You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Capturing ideas and snippets of information about cloud analytics.

Building blocks

(Open to suggestions about better categories or names of categories!)

Data Access

  • WCS 2.0 - multi-dimensional coverage data access over the Internet
  • OPenDAP - discipline-neutral means of requesting and providing data across the World Wide Web

Data Processing Services

  • WPS 2.0 - rules for standardizing  inputs and outputs for geospatial processing services

  • WCPS 1.0 - protocol-independent language for the extraction, processing, and analysis of multi-dimensional coverages representing sensor, image, or statistics data.

Data Models & Formats

  • Common Data Model - Unidata's abstract data model for scientific datasets, merges netCDF, HDF5, and OPeNDAP data models
  • Cloud Optimized GeoTIFF - GeoTIFF with internal organization that enables more efficient workflows on the cloud via HTTP GET range requests
  • EO JSON - a number of efforts to develop JSON specs for coverage data

Data Libraries

  • xarray - toolkit for analytics on multi-dimensional arrays for pandas
  • dask - flexible parallel computing library for analytic computing

Workflow and orchestration

  • dask.distributed - lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters
  • daskernetes - deploys Dask workers on Kubernetes clusters using native Kubernetes APIs. It is designed to dynamically launch short-lived deployments of workers during the lifetime of a Python process.

Visualization & Interaction

  • PyTables - built on top of the HDF5 library, tool for interactively browsing, processing and searching very large amounts of data

  • Jupyter Notebooks - Interactive code execution and visualization

Metadata & Catalogs

  • SpatioTemporal Asset Catalog - expose Earth observation data as spatiotemporal asset catalogs (possible on-the-fly catalog for cloud pipelines?)

Interoperability Tools

  • OpenAPI initiative - standardizing how to describe REST APIs (based on swagger)

Other work

NASA

  • Cumulus - Cloud-based data ingest, archive, distribution and management system for EOSDIS

Other

  • Pangeoan experimental deployment of JupyterHub, Dask, and XArray on Google Container Engine (GKE) to support atmospheric and oceanographic data analysis on large datasets

Tutorials & Articles

Tutorials

Articles

  • Fostering Cross-Disciplinary Earth Science Through Datacube Analytics, P Baumann et al, 2018 - Abstract, Chapter PDF
  • Archive Management of NASA Earth Observation Data to Support Cloud Analysis, C Lynnes, K Baynes, M McInerney, 2017 - PDF

Questions

  • How does one do interprocess communication in the cloud? In the old days there was Shared Memory, Pipes, Sockets, and Files. In the cloud it seems there's primarily HTTP (and maybe sockets) or files.
  • How do you decide when it's better to write out a file so you can stop running a process that's costing per minute vs. holding things in memory so you don't incur storage charges?
  • No labels