Pangeo: Earth Science
=====================

Who Am I?
---------

I am [Ryan Abernathey](http://rabernat.github.io), a physical oceanographer and
professor at [Columbia University](http://columbia.edu) /
[Lamont Doherty Earth Observatory](http://ldeo.columbia.edu).

I am a founding member of the [Pangeo Project](http://pangeo.io), an
initiative aimed at coordinating and supporting the development of open source
software for the analysis of very large geoscientific datasets such as
satellite observations or climate simulation outputs.  Pangeo is funded by
[National Science Foundation Grant
1740648](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1740648&HistoricalAwards=false),
of which I am the principal investigator.

What Problem are We Trying to Solve?
------------------------------------

Many oceanographic and atmospheric science datasets consist of multi-dimensional
arrays of numerical data, such as temperature sampled on a regular latitude,
longitude, depth, time grid. These can be real data, observed by instruments
like weather balloons, satellites, or other sensors; or they can be "virtual"
data, produced by simulations. Scientists in these fields perform an extremely
wide range of different analyses on these datasets. For example:

- simple statistics like mean and standard deviation
- principal component analysis of spatio-temporal variability
- intercomparison of datasets with different spatio-temporal sampling
- spectral analysis (Fourier transforms) over various space and time dimensions
- budget diagnostics (e.g. calculating terms in the equation for heat conservation)
- machine learning for pattern recognition and prediction

Scientists like to work interactively and iteratively, trying out calculations,
visualizing the results, and tweaking their code until they eventually settle on
a result that is worthy of publication.

The traditional workflow is to download datasets to a personal laptop or
workstation and peform all analysis there. As sensor technology and computer
power continue to develop, the volume of our datasets is growing exponentially.
This workflow is not feasible or efficient with multi-terabyte datasets, and it
is impossible with petabyte-scale datasets. The fundamental problem we are
trying to solve in Pangeo is **how do we maintain the ability to perform
rapid, interactive analysis in the face of extremely large datasets?**
Dask is an essential part of our solution.

How Dask Helps
--------------

Our large multi-dimensional arrays map very well to Dask's `array` model. Our
users tend to interact with Dask via [Xarray](http://xarray.pydata.org), which
adds additional label-aware operations and group-by / resample capabilities.
The Xarray data model is explicitly inspired by the Common Data Model format
widely used in geosciences. Xarray has incorporated dask from very early in its
development, leading to close integration between these packages.

Pangeo provides configurations for deploying Jupyter, Xarray and Dask on
high-performance computing clusters and cloud platforms. On these platforms,
our users load data lazily using xarray from a variety of different storage
formats and perform analysis inside Jupyter notebooks. Working closely with
the Dask development team, we have tried to simplify the process of launching
Dask clusters interactively by using packages such as
[dask-kubernetes](https://github.com/dask/dask-kubernetes) and
[dask-jobqueue](https://github.com/dask/dask-jobqueue).
Users employ those packages to interactively launch their own Dask clusters
across many nodes of the compute system. Dask then automatically parallelizes
the xarray-based computations without users having to write much specialized
parallel code. Users appreciate the Dask dashboard, which provides a visual
indication of the progress and efficiency of their ongoing analysis. When
everything is working well, Dask is largely transparent to the user.

Why We Chose Dask Originally
----------------------------

Pangeo emerged from the Xarray development group, so Dask was a natural choice.
Beyond this, Dask's flexibility is a good fit for our applications; as
described above, scientists in this domain perform a huge range of different
types of analysis. We need a parallel computing engine which does not strongly
constrain the type of computations that can be performed nor require the user
to engage with the details of parallelization.

Pain Points
-----------

Dask's flexibility comes with some overhead.
I have the impression that the size of the graphs our users generate, which
can easily exceed a million tasks, is pushing the limits of the dask scheduler.
It is not uncommon for the scheduler to crash, or to take an uncomfortably long
time to process, when these tasks are submitted. Our workaround is mostly to
fall back on the sort of loop-based iteration over large datasets that we had
to do pre-Dask. All of this undermines the interactive experience we are trying
to achieve.

However, the first year of this project has made me optimistic about the future.
I think the interaction between Pangeo users and Dask developers has been
pretty successful. Our use cases have helped identify several performance
bottlenecks that have been fixed at the Dask level. If this trend can continue,
I'm confident we will be able to reach our desired scale (petabytes) and speed.

A broader issue relates to onboarding of new users. While I said above that
Dask operates transparently to the users, this is not always the case. Users
used to writing loop-based code to process datasets have to be retrained around
the delayed-evaluation paradigm. It can be a challenge to translate legacy code
into a Dask-friendly format. Some sort of "cheat sheet" might be able to help
with this.

Technology around Dask
----------------------

[Xarray](https://xarray.pydata.org) is the main way we interact with Dask. We use the
[`dask-jobqueque`](https://jobqueue.dask.org) and
[`dask-kubernetes`](https://kubernetes.dask.org) projects heavily.

We also use [Zarr](http://zarr.readthedocs.io) extensively for storage,
especially on the cloud, where we also employ
[`gcsfs`](https://gcsfs.readthedocs.io) and
[`s3fs`](https://s3fs.readthedocs.io) to interface with cloud storage.


Copyright and License
---------------------

Copyright 2020 Ryan Abernathey. I license this work under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.