You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 12
Next »
Capturing ideas and snippets of information about cloud analytics.
Building blocks
(Open to suggestions about better categories or names of categories!)
Data Access
- WCS 2.0 - multi-dimensional coverage data access over the Internet
- OPenDAP - discipline-neutral means of requesting and providing data across the World Wide Web
Data Processing Services
- Common Data Model - Unidata's abstract data model for scientific datasets, merges netCDF, HDF5, and OPeNDAP data models
- Cloud Optimized GeoTIFF - GeoTIFF with internal organization that enables more efficient workflows on the cloud via HTTP GET range requests
- EO JSON - a number of efforts to develop JSON specs for coverage data
Data Libraries
- xarray - toolkit for analytics on multi-dimensional arrays for pandas
- dask - flexible parallel computing library for analytic computing
Workflow and orchestration
- dask.distributed - lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters
Visualization & Interaction
- SpatioTemporal Asset Catalog - expose Earth observation data as spatiotemporal asset catalogs (possible on-the-fly catalog for cloud pipelines?)
- OpenAPI initiative - standardizing how to describe REST APIs (based on swagger)
References / Links
Other work
NASA
- Cumulus - Cloud-based data ingest, archive, distribution and management system for EOSDIS
Other
- Pangeo - an experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets
Tutorials & Articles
Tutorials
Articles
- Fostering Cross-Disciplinary Earth Science Through Datacube Analytics, P Baumann et al, 2018 - Abstract, Chapter PDF
- Archive Management of NASA Earth Observation Data to Support Cloud Analysis, C Lynnes, K Baynes, M McInerney, 2017 - PDF
Questions
- How does one do interprocess communication in the cloud? In the old days there was Shared Memory, Pipes, Sockets, and Files. In the cloud it seems there's primarily HTTP (and maybe sockets) or files.
- How do you decide when it's better to write out a file so you can stop running a process that's costing per minute vs. holding things in memory so you don't incur storage charges?