The scope of this wiki page is to capture developer documentation for the Harmony GDAL Adapter (HGA). This page should help us understand the features of HGA, including the viable input and output types, along with available transformations. This page could also include known issues and anticipated development efforts for the service.

Service Overview:

HGA is a Harmony backend service offering a number of transformation features, which leverage the common geospatial tool GDAL. This service has been managed and developed by several teams since its original inception, including the Harmony core team, the Alaska Satellite Facility (ASF), the Physical Oceanography DAAC (PO.DAAC) and now the Data Services team. The source code currently exists in an open-source repository, within the NASA GitHub organisation.

Service capabilities:

Refer to regression tests for a good start here, the bullet point list for 

  • Variable subsetting.
  • Bounding box spatial subsetting.
  • Shape file spatial subsetting (present in the code, untested).
  • Temporal subsetting - there are regression tests for this, but I don't see any code to do it.
  • Projection.
  • Reformatting.
  • "Resizing" (projection and/or regridding).
  • "Recolouring".
  • Snapshot imagery generation.
  • Inputs: NetCDF-4, GeoTIFF (or a zip-file containing either of those file types).
  • Outputs: GeoTIFF, NetCDF-4, .png, .img (needs work).

Overall architecture:

The steps below are a rough description of the path taken through transform.py, the primary source of service code for HGA.

General note: GDAL commands are mostly constructed as string that would be used in a shell or other terminal and then executed using Python's subprocess.check_output function (see HarmonyAdapter.cmd). It would be great to use the Python GDAL packages that allow these transformations to be written in Python-native code. (Some operations, such as masking for a polygon spatial subset, do adopt this Python-based approach)

  • transform.py has a large subclass of the HarmonyBaseAdapter from harmony-service-lib-py.
  • All requests go through the HarmonyAdapter.process_item method.
  • Granule information is initially derived from STAC items.
  • The type of the input file is determined from the extension of the downloaded input file (expecting a GeoTIFF, NetCDF-4 file or a zip file containing one a GeoTIFF/NetCDF-4 file).
  • The appropriate function from HarmonyAdapter.process_geotiff, HarmonyAdapter.process_netcdf or HarmonyAdapter.process_zip is called.
    • HarmonyAdapter.process_zip extracts files from the zip file and then calls either HarmonyAdapter.process_geotiff or HarmonyAdapter.process_netcdf.
    • HarmonyAdapter.process_geotiff calls HarmonyAdapter.combin_transfer, which in turn calls class methods (in this order): HarmonyAdapter.subset, HarmonyAdapter.reproject, HarmonyAdapter.resize, HarmonyAdapter.recolor
    • HarmonyAdapter.process_netcdf uses HarmonyAdapter.nc2tiff to convert the contents of the NetCDF-4 into GeoTIFF files using gdal_translate. Then it calls HarmonyAdapter.combin_transfer.
  • HarmonyAdapter.combin_transfer in turn calls four processing methods that perform various subsetting, projection, gridding and colouring operations on the input:
    • HarmonyAdapter.subset:
      • Combines spatial subsetting, bounding box or (allegedy) shape file, with variable subsetting.
      • Spatial subsetting occurs with HarmonyAdapter.subset2, which uses: gdal_translate to extract a bounding box, or the minimally encompassing bounding box for a shape file. Masking is then applied in the case of a shape file.
      • Variable subsetting occurs using HarmonyAdapter.varsubset, and again uses gdal_translate.
    • HarmonyAdapter.reproject:
      • Builds a gdalwarp command that uses request parameters: crs, interpolation, scaleSize, scaleExtent.
    • HarmonyAdapter.resize:
      • Builds a gdal_translate command that uses request parameters: interpolation (defaults to "bilinear"), width and height.
    • HarmonyAdapter.recolor:
      • Builds a gdaldem command.
      • Checks the input Harmony message for a colour map or defaults to grey (PCESA-2456). The colour map information is stored in the UMM-Var record associate with the variable being transformed.
      • The colour maps are stored locally in gdal_subsetter/colormaps.
  • The HarmonyAdapter.add_to_results function will compile the output files, and stack if appropriate. The stacking occurs using HarmonyAdapter.stack_multi_file_with_metadata (the logic in these method might relate to issues with DAS-1479).
  • Back in HarmonyAdapter.process_item, the output STAC item is updated and the outputs are staged to the Harmony S3 bucket.

CI/CD:

Future development:

  • DAS-1478: Adding regression tests for the PO.DAAC MUR collection.
  • DAS-1479: Ensuring a request for NetCDF-4 output and "all" variables includes a STAC item for the output.
  • DAS-1480: Ensuring a request for GeoTIFF output and "all" variables includes links to all bands, not just the colour bands for the first variable.
  • DAS-XXXX: General code clean-up (e.g., breaking down the transform.py monolith, and the HarmonyAdapter class within it).
    • A potential separate module could contain all the GDAL command construction, within which we could gradually iterate away from the current subprocess architecture.
    • Another separate module could handle coordinate-related utilities functions (e.g., deriving minimally encompassing bounding boxes, etc).
  • ASF to remove their GitHub repository and Docker images in DockerHub.
  • Any Bamboo CI/CD relating to the ASF artefacts should be removed.
  • Uninformative band names in .png output, currently "Band 1", "Band 2", etc.  Is it sufficient to assume that the output name will explicitly include the variable name - what if it's an all-variable request? (Discovered during DAS-1478)
  • Requests for MUR data with a whole-Earth bounding box (36,000 x 17,999 array elements) lead to containers running out of memory and therefore request failures. Do we need to assign more memory to the HGA containers via Harmony environment variables? Matt Savoie was able to get a request to run locally with 16 GB of memory. UAT was failing with 8 GB. (Discovered during DAS-1478)
  • requests for image/png with multiple variables fail to return multiple png files. Owen did a cursory glance and discovered "So from the quickest of grokking. I think that recolor gets called for all variables (via combin_transfer from this loop or this loop). But... there's only ever one call to stage the output (here). There's a separate call to stage the .wld file, but nothing to loop through multiple outputs."
  • NetCDF-4 file output is reconstructed to more closely match the input structure, this includes: variable names, file structure, persistence of metadata variables, array structure (e.g., recombining banded variables into a single variable output, rather than "Band 1", "Band 2" etc sub-variables). This will require consultation with ASF and PO.DAAC stakeholders, to ensure they agree this is an improvement.
  • No labels