Satellite Imagery Processing ============================ Who am I? --------- I am [David Hoese](http://github.com/djhoese) and I work as a software developer at the [Space Science and Engineering Center (SSEC)](https://www.ssec.wisc.edu/) at the University of Wisconsin - Madison. My job is to create software that makes meteorological data more accessible to scientists and researchers. I am also a member of the open source [PyTroll](http://pytroll.github.io/) community where I act as a core developer on the [SatPy](http://satpy.readthedocs.io/en/latest/) library. I use SatPy in my SSEC projects called Polar2Grid and Geo2Grid which provide a simple command line interface on top of the features provided by SatPy. The Problem I'm trying to solve ------------------------------- Satellite imagery data is often hard to read and use because of the many different formats and structures that it can come in. To make satellite imagery more useful the SatPy library wraps common operations performed on satellite data in simple interfaces. Typically meteorological satellite data needs to go through some or all of the following steps: - **Read:** Read the observed scientific data from one or more data files while keeping track of geolocation and other metadata to make the data the most useful and descriptive. - **Composite:** Combine one or more different "channels" of data to bring out certain features in the data. This is typically shown as RGB images. - **Correct:** Sometimes data has artifacts, from the instrument hardware or the atmosphere for example, that can be removed or adjusted. - **Resample:** Visualization tools often support a small subset of Earth projections and satellite data is usually not in those projections. Resampling can also be useful when wanting to do intercomparisons between different instruments (on the same satellite or not). - **Enhancement:** Data can be normalized or scaled in certain ways that makes certain atmospheric conditions more apparent. This can also be used to better fit data in to other data types (8-bit integers, etc). - **Write:** Visualization tools typically only support a few specific file formats. Some of these formats are difficult to write or have small differences depending on what application they are destined for. As satellite instrument technology advances, scientists have to learn how to handle more channels for each instrument and at spatial and temporal resolutions that were unheard of when they were learning how to use satellite data. If they are lucky, scientists may have access to a high performance computing system, while the rest may have to settle for long execution times on their desktop or laptop machines. By optimizing the parts of the processing that take a lot of time and memory it is our hope that scientists can worry about the science and leave the annoying parts to SatPy. How Dask helps -------------- SatPy's use of Dask makes it possible to do calculations on laptops that used to require high performance server machines. SatPy was originally drawn to [Xarray](http://xarray.pydata.org/en/stable/)'s `DataArray` objects for the metadata storing functionality and the support for Dask arrays. We knew that our original usage of Numpy masked arrays was not scalable to the new satellite data being produced. SatPy has now switched to `DataArray` objects backed by Dask and leverages Dask's ability to do the following: - **Lazy evaluation:** Software development is so much easier when you don't have to remove intermediate results from memory to process the next step. - **Task caching:** Our processing involves a lot of intermediate results that can be shared between different processes. When things are optimized in the Dask graph it saves us as developers from having to code the "reuse" logic ourselves. It also means that intermediate results that are no longer needed can be disposed of and their memory freed. - **Parallel workers and array chunking:** Satellite data is usually compared by geographic location. So a pixel at one index is often compared with the pixel of another array at that same index. Splitting arrays in to chunks and processing them separately provides us with a great performance improvement and not having to manage which worker gets what chunk of the array makes development effortless. Benefiting from all of the above lets us create amazing high resolution RGB images in 6-8 minutes on 3 year old laptops that would have taken 35+ minutes to crash from memory limitations with SatPy's old Numpy implementation. Pain points when using Dask --------------------------- 1. Dask arrays are not Numpy arrays. Almost everything is supported or is close enough that you get used to it, but not everything. Most things you can get away with and get perfectly good performance; others you may end up computing your arrays multiple times in just a couple lines of code when you didn't know it. Sometimes I wish that there was a Dask feature to raise an exception of your array is computed without you specifically saying it was ok. 2. Writing to common satellite data formats, like GeoTIFF, can't always be written to by multiple writers (multiple nodes on a cluster) and some aren't even thread-safe. Opening a file object and using it with `dask.array.store` may work with some schedulers and not others. 3. Dimension changes are a pain. Satellite data processing some times involves lookup tables to save on bandwidth limitations when sending data from the satellite to the ground or other similar situations. Having to use lookup tables, including something like a KDTree, can be really difficult and confusing to code with Dask and get it right. It typically involves using `atop`, `map_blocks`, or sometimes suffering the penalty of passing things to a `Delayed` function where the entire data array is passed as one complete memory-hungry array. 4. A lot of satellite processing seems to perform better with the default threaded Dask scheduler over the distributed scheduler due to the nature of the problems being solved. A lot of processing, especially the creation of RGB images, requires comparing multiple arrays in different ways and can suffer from the amount of communication between distributed workers. There isn't an easy way that I know of to control where things are processed and which scheduler to use without requiring users to know detailed information on the internal of Dask. Technology I use around Dask ---------------------------- As mentioned earlier SatPy uses [Xarray](http://xarray.pydata.org/en/stable/)to wrap most of our Dask operations when possible. We have other useful tools that we've created in the PyTroll community to help support deploying satellite processing tools on servers, but they are not specific to Dask. Links ----- - [PyTroll Community](http://pytroll.github.io/) - [SatPy](http://satpy.readthedocs.io/en/latest/) - [SatPy Examples](http://satpy.readthedocs.io/en/latest/examples.html) - [PyResample](http://pyresample.readthedocs.io/en/latest/)