I’ve been away from the blog for awhile, but thought I’d catch up a bit. I am in beautiful Madison Wisconsin (Lake Mendota! 90 degrees! Rain! Fried cheese curds!) for the NASA LP DAAC User Working Group meeting. This is a cool deal where imagery and product users meet with NASA team leaders to review products and tools. Since this UWG process is new to me, I am highlighting some of the key fun things I learned.
What is a DAAC?
A DAAC is a Distributed Active Archive Center, run by NASA Earth Observing System Data and Information System (EOSDIS). These are discipline-specific facilities located throughout the United States. These institutions are custodians of EOS mission data and ensure that data will be easily accessible to users. Each of the 12 EOSDIS DAACs process, archive, document, and distribute data from NASA's past and current Earth-observing satellites and field measurement programs. For example, if you want to know about snow and ice data, visit the National Snow and Ice Data Center (NSIDC) DAAC. Want to know about social and population data? Visit the Socioeconomic Data and Applications Data Center (SEDAC). These centers of excellence are our taxpayer money at work collecting, storing, and sharing earth systems data that are critical to science, sustainability, economy, and well-being.
What is the LP DAAC?
The Land Processes Distributed Active Archive Center (LP DAAC) is one of several discipline-specific data centers within the NASA Earth Observing System Data and Information System (EOSDIS). The LP DAAC is located at the USGS Earth Resources Observation and Science (EROS) Center in Sioux Falls, South Dakota. LP DAAC promotes interdisciplinary study and understanding of terrestrial phenomena by providing data for mapping, modeling, and monitoring land-surface patterns and processes. To meet this mission, the LP DAAC ingests, processes, distributes, documents, and archives data from land-related sensors and provides the science support, user assistance, and outreach required to foster the understanding and use of these data within the land remote sensing community.
Why am I here?
Each NASA DAAC has established a User Working Group (UWG). There are 18 people on the LP DAAC committee, 12 members from the land remote sensing community at large, like me! Some cool stuff going on. Such as...
Two upcoming launches are super interesting and important to what we are working on. First, GEDI (Global Ecosystem Dynamics Investigation) will produce the first high resolution laser ranging observations of the 3D structure of the Earth. Second, ECOSTRESS (The ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station), will measure the temperature of plants: stressed plants get warmer than plants with sufficient water. ECOSTRESS will use a multispectral thermal infrared radiometer to measure surface temperature. The radiometer will acquire the most detailed temperature images of the surface ever acquired from space and will be able to measure the temperature of an individual farmer's field. Both of these sensors will be deployed on the International Space Station, so data will be in swaths, not continuous global coverage. Also, we got an update from USGS on the USGS/NASA plan for the development and deployment of Landsat 10. Landsat 9 comes 2020, Landsat 10 comes ~2027.
Other Data Projects
We heard from other data providers, and of course we heard from NEON! Remember I posted a series of blogs about the excellent NEON open remote sensing workshop I attended last year. NEON also hosts a ton of important ecological data, and has been thinking through the issues associated with cloud hosting. Tristin Goulden was here to give an overview.
NASA staff gave us a series of demos on their WebGIS services; AppEEARS; and their data website. Their webGIS site uses ArcGIS Enterprise, and serves web image services, web coverage services and web mapping services from the LP DAAC collection. This might provide some key help for us in IGIS and our REC ArcGIS online toolkits. AppEEARS us their way of providing bundles of LP DAAC data to scientists. It is a data extraction and exploration tool. Their LP DAAC data website redesign (website coming soon), which was necessitated by the requirement for a permanent DOI for each data product.
LP DAAC is going full-force in user engagement: they do workshops, collect user testimonials, write great short pieces on “data in action”, work with the press, and generally get the story out about how NASA LP DAAC data is used to do good work. This is a pretty great legacy and they are committed to keep developing it. Lyndsey Harriman highlighted their excellent work here.
Grand Challenges for remote sensing
Some thoughts about our Grand Challenges: 1) Scaling: From drones to satellites. It occurs to me that an integration between the ground-to-airborne data that NEON provides and the satellite data that NASA provides had better happen soon; 2) Data Fusion/Data Assimilation/Data Synthesis, whatever you want to call it. Discovery through datasets meeting for the first time; 3) Training: new users and consumers of geospatial data and remote sensing will need to be trained; 4) Remote Sensible: Making remote sensing data work for society.
A primer on cloud computing
We spent some time on cloud computing. It has been said that cloud computing is just putting your stuff on “someone else’s computer”, but it is also making your stuff “someone else’s problem”, because cloud handles all the painful aspects of serving data: power requirements, buying servers, speccing floor space for your servers, etc. Plus, there are many advantages of cloud computing. Including: Elasticity. Elastic in computing and storage: you can scale up, or scale down or scale sideways. Elastic in terms of money: You pay for only what you use. Speed. Commercial clouds CPUs are faster than ours, and you can use as many as you want. Near real time processing, massive processing, compute intensive analysis, deep learning. Size. You can customize this; you can be fast and expensive or slow and cheap. You use as much as you need. Short-term storage of large interim results or long-term storage of data that you might use one day.
Image courtesy of Chris Lynnes
We can use the cloud as infrastructure, for sharing data and results, and as software (e.g. ArcGIS Online, Google Earth Engine). Above is a cool graphic showing one vision of the cloud as a scaled and optimized workflow that takes advantage of the cloud: from pre-processing, to analytics-optimized data store, to analysis, to visualization. Why this is a better vision: some massive processing engines, such as SPARC or others, require that data be organized in a particular way (e.g. Google Big Table, Parquet, or DataCube). This means we can really crank on processing, especially with giant raster stacks. And at each step in the workflow, end-users (be they machines or people) can interact with the data. Those are the green boxes in the figure above. Super fun discussion, leading to importance of training, and how to do this best. Tristan also mentioned Cyverse, a new NSF project, which they are testing out for their workshops.
Image attribution: Corey Coyle
Super fun couple of days. Plus: Wisconsin is green. And warm. And Lake Mendota is lovely. We were hosted at the University of Wisconsin by Mutlu Ozdogan. The campus is gorgeous! On the banks of Lake Mendota (image attribution: Corey Coyle), the 933-acre (378 ha) main campus is verdant and hilly, with tons of gorgeous 19th-century stone buildings, as well as modern ones. UW was founded when Wisconsin achieved statehood in 1848, UW–Madison is the flagship campus of the UW System. It was the first public university established in Wisconsin and remains the oldest and largest public university in the state. It became a land-grant institution in 1866. UW hosts nearly 45K undergrad and graduate students. It is big! It has a med school and a law school on campus. We were hosted in the UW red-brick Romanesque-style Science Building (opened in 1887). Not only is it the host building for the geography department, it also has the distinction of being the first buildings in the country to be constructed of all masonry and metal materials (wood was used only in window and door frames and for some floors), and may be the only one still extant. How about that! Bye Wisconsin!
First of all, Pearl Street Mall is just as lovely as I remember, but OMG it is so crowded, with so many new stores and chains. Still, good food, good views, hot weather, lovely walk.
Welcome to Day 2! http://neondataskills.org/data-institute-17/day2/
Our morning session focused on reproducibility and workflows with the great Naupaka Zimmerman. Remember the characteristics of reproducibility - organization, automation, documentation, and dissemination. We focused on organization, and spent an enjoyable hour sorting through an example messy directory of misc data files and code. The directory looked a bit like many of my directories. Lesson learned. We then moved to working with new data and git to reinforce yesterday's lessons. Git was super confusing to me 2 weeks ago, but now I think I love it. We also went back and forth between Jupyter and python stand alone scripts, and abstracted variables, and lo and behold I got my script to run.
The afternoon focused on Lidar (yay!) and prior to coding we talked about discrete and waveform data and collection, and the opentopography (http://www.opentopography.org/) project with Benjamin Gross. The opentopography talk was really interesting. They are not just a data distributor any more, they also provide a HPC framework (mostly TauDEM for now) on their servers at SDSC (http://www.sdsc.edu/). They are going to roll out a user-initiated HPC functionality soon, so stay tuned for their new "pluggable assets" program. This is well worth checking into. We also spent some time live coding with Python with Bridget Hass working with a CHM from the SERC site in California, and had a nerve-wracking code challenge to wrap up the day.
Fun additional take-home messages/resources:
- ISO International standard for dates = YYYY-MM-DD
- Missing values in R = NA, in Python = -9999
- For cleaning messy data - check out OpenRefine - a FOS tool for cleaning messy data http://openrefine.org/
- Excel is cray-cray, best practices for spreadsheets: http://www.datacarpentry.org/spreadsheet-ecology-lesson/
- Morpho (from DataOne) to enter metadata: https://www.dataone.org/software-tools/morpho
- Pay attention to file size with your git repositories - check out: https://git-lfs.github.com/. Git is good for things you do with your hands (like code), not for large data.
- Funny how many food metaphors are used in tech teaching: APIs as a menu in a restaurant; git add vs git commit as a grocery cart before and after purchase; finding GIS data is sometimes like shopping for ingredients in a specialty grocery store (that one is mine)...
- Markdown renderer: http://dillinger.io/
- MIT License, like Creative Commons for code: https://opensource.org/licenses/MIT
- "Jupyter" means it runs with Julia, Python & R, who knew?
- There is a new project called "Feather" that allows compatibility between python and R: https://blog.rstudio.org/2016/03/29/feather/
- All the NEON airborne data can be found here: http://www.neonscience.org/data/airborne-data
- Information on the TIFF specification and TIFF tags here: http://awaresystems.be/, however their TIFF Tag Viewer is only for windows.
Thanks for everyone today! Megan Jones (our fearless leader), Naupaka Zimmerman (Reproducibility), Tristan Goulden (Discrete Lidar), Keith Krause (Waveform Lidar), Benjamin Gross (OpenTopography), Bridget Hass (coding lidar products).
Our home for the week
Just trying to get my head around some of the new big raster processors out there, in addition of course to Google Earth Engine. Bear with me (bare?) while I sort through these. Thanks for raster sleuth Stefania Di Tomasso for the leg work.
1. Geotrellis (https://geotrellis.io/)
Geotrellis is a Scala-based raster processing engine, and it is one of the first geospatial libraries on Spark. Geotrellis is able to process big datasets. Users can interact with geospatial data and see results in real time in an interactive web application (for regional, statewide dataset). For larger raster datasets (eg. US NED). GeoTrellis performs fast batch processing using Akka clustering to distribute data across the cluster. GeoTrellis was designed to solve three core problems, with a focus on raster processing:
- Creating scalable, high performance geoprocessing web services;
- Creating distributed geoprocessing services that can act on large data sets; and
- Parallelizing geoprocessing operations to take full advantage of multi-core architecture.
- GeoTrellis is designed to help a developer create simple, standard REST services that return the results of geoprocessing models.
- GeoTrellis will automatically parallelize and optimize your geoprocessing models where possible.
- In the spirit of the object-functional style of Scala, it is easy to both create new operations and compose new operations with existing operations.
2. GeoPySpark - in synthesis GeoTrellis for Python community
Geopyspark provides python bindings for working with geospatial data on PySpark (PySpark is the Python API for Spark). Spark is open source processing engine originally developed at UC Berkeley in 2009. GeoPySpark makes Geotrellis (https://geotrellis.io/) accessible to the python community. Scala is a difficult language so they have created this Python library.
We had a great day today exploring ESRI open tools in the GIF. We had a full class of 30 participants, and two great ESRI instructors (leaders? evangelists?) John Garvois and Allan Laframboise, and we worked through a range of great online mapping (data, design, analysis, and 3D) examples in the morning, and focused on using ESRI Leaflet API in the afternoon. Here are some of the key resources out there.
- Main ESRI Open Information: http://www.esri.com/software/open
- Slide deck from today: http://slides.com/alaframboise/geodev-hackerlabs#/
- Afternoon example using Leaflet: http://esri.github.io/esri-leaflet/
- ESRI's developer toolkits: https://developers.arcgis.com/en/, including
Great Stuff! Thanks Allan and John
The researchers built the cropland database by combining information from several sources, such as satellite images, regional maps, video and geotagged photos, which were shared with them by groups around the world. Combining all that information would be an almost-impossible task for a handful of scientists to take on, so the team turned the project into a crowdsourced, online game. Volunteers logged into "Cropland Capture" on a computer or a phone and determined whether an image contained cropland or not. Participants were entered into weekly prize drawings./span>