This page contains recommendations adopted by the NASA Earth Science Data System (ESDS) Dataset Interoperability Working Group (DIWG). They are meant to increase and enhance the interoperability of Earth Science data product files by improving their compliance, discoverability, and extensibility with relevant metadata conventions. These recommendations are extracts from the recommendations documents published by the ESDIS Standards Coordination Office (ESCO). They have been copied verbatim with only formatting changes. If there are any differences between the two versions, the ESCO-published documents should be considered as accurate.

This page will be updated as new recommendations are adopted.

Glossary

ACDDAttribute Convention for Dataset Discovery
ASCIIAmerican Standard Code for Information Interchange
attributeHDF5 attribute or netCDF attribute
APIApplication Program Interface
CCBConfiguration Change Board
CCRConfiguration Change Request
CDMUnidata Common Data Model
CFClimate and Forecast Metadata Conventions
CRSCoordinate Reference System
COARDSCooperative Ocean/Atmospheric Research Data Service
DAACDistributed Active Archive Center
DIWGDataset Interoperability Working Group
EOSDISEarth Observing System Data and Information System
ESDSEarth Science Data Systems
ESDISEarth Science Data and Information System
ESDSWGEarth Science Data System Working Groups
ESIPFederation of Earth Science Information Partners
ESCOESDIS Standards Coordination Office
groupHDF5 group or netCDF group
GSFCGoddard Space Flight Center
HDFHierarchical Data Format
HDF4Hierarchical Data Format, version 4
HDF5Hierarchical Data Format, version 5
HDF-EOSHierarchical Data Format - Earth Observing System
HDF-EOS5Hierarchical Data Format - Earth Observing System, version 5
HPDHDF Product Designer
IEEEInstitute of Electrical and Electronics Engineers
ISOInternational Organization for Standardization
JPLJet Propulsion Laboratory
MCCMetadata Compliance Checker
NASANational Aeronautics and Space Administration
NaNNot-a-Number
ncksnetCDF Kitchen Sink
NCOnetCDF Operator
ncpdqnetCDF Permute Dimensions Quickly
ncwanetCDF Weighted Averager
netCDFNetwork Common Data Form
netCDF3Network Common Data Form, version 3
netCDF4Network Common Data Form, version 4
NUGNetCDF User Guide
OGCOpen Geospatial Consortium
OPeNDAPOpen-source Project for a Network Data Access Protocol
RFCRequest For Comments
THREDDSThematic Real-time Environmental Distributed Data Services
UTCUniversal Time Coordinated (also: Coordinated Universal Time)
UTF-8Unicode Transformation Format—8-bit
variableHDF5 dataset or netCDF variable
WGS-84World Geodetic System 84
WKTWell-Known Text
XMLExtensible Markup Language

Recommendations Approved by ESCO

  • Page:
    Adopt Semantically Rich Dataset Release Identifiers

    We recommend that dataset release (version) identifiers can at least represent the following information:

    • Changes in the file content that only improve its compliance with the already published documentation ("patch" release).
  • Page:
    Character Set for User-Defined Group, Variable, and Attribute Names

    We recommend that user-defined group, variable, and attribute names follow the Climate and Forecast (CF) convention's specification. The names shall comply with this regular expression: [A-Za-z][A-Za-z0-9_]* . Exempt are system-defined names for any of these objects that are required by various APIs or conventions.

  • Page:
    Consider “balanced” chunking for 3-D datasets in grid structures

    We recommend that "balanced" chunking be considered for three-dimensional datasets in grid structures.

    Recommendation Details: If a dataset is exceptionally large, it is often more useful to break it up into manageable parts. This process is known as chunking and is used on data in datasets that are part of a grid structure. Exactly how the data chunking is done can greatly affect performance for the end user. Because the precise access pattern employed by the end user is usually unknown until the distributor analyzes sufficient requests to discern a pattern, it is difficult to determine the most effective way to chunk.

  • Page:
    Consistent Units Attribute Value for Variables Across One Data Collection

    We recommend the units attribute’s value for a particular variable if given should be the same across all files in a data collection.

    Recommendation Details: Knowing the physical units of data in a variable is vital for proper use. While the presence of the units attribute satisfies that requirement, its value may vary from one file to another for the same variable. For example, while these two values of the units attribute represent length: mm and μm, the software processing that variable’s data from different files may not take the attribute’s value change into account. Using the same units attribute value for one variable throughout a data collection decreases the chance of errors.

  • Page:
    Date-Time Information in Granule Filenames

    We recommend that date-time information in granule file names adheres to the following guidelines:

    • Adopt the ISO 8601 [11] standard for date-time information representation.
  • Page:
    Distinguish clearly between HDF and netCDF packing conventions

    We recommend that datasets with non-netCDF packing be clearly distinguished from datasets that use the netCDF packing convention.

    Recommendation Details: Earth Science observers and modelers often employ a technique called “packing” (a.k.a. “scaling’) to make their product files smaller. "Packed" datasets must be correctly "unpacked" before they can be used properly. Confusingly, non-netCDF (e.g., HDF4_CAL) and netCDF algorithms both store their parameters in attributes with the same or similar names – and unpacking one algorithm with the other will result in incorrect conversions. Many netCDF-based tools are equally unaware of the non-netCDF (e.g., HDF_CAL) packing cases and so interpret all readable data using the netCDF convention. Unfortunately, few users are aware that their datasets may be packed, and fewer know the details of the packing algorithm employed. This is an interoperability issue because it hampers data analysis performed on heterogeneous systems.

  • Page:
    Ensure Granule's Filename Uniqueness Across Different Dataset Releases

    We recommend that each granule belonging to an Earth Science dataset in a public archive have a unique file name across different dataset releases (versions, collections) to improve interoperability and avoid confusion. The minimum content to ensure unique granule file name consists of:

  • Page:
    Include Basic CF Attributes

    We recommend that, at minimum, the following basic Climate and Forecast (CF) Convention attributes be included in future NASA Earth Science data products where applicable.

    Recommendation Details : The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage. Included in the conventions is a comprehensive list of metadata attributes that are available for use by dataset producers. Because the list of metadata attributes is so extensive, dataset producers are constantly struggling with which metadata attributes to attach to a variable.

  • Page:
    Include Datum Attributes for Data in Grid Structures

    We recommend that Horizontal and Vertical (as necessary) Datum attributes be included for data in grid structures.

    Recommendation Details: Locations on Earth are specified using coordinates, which are tuples of numbers that describe the horizontal and vertical distances from a fixed point, thereby pinpointing a particular place on the map at some level of precision. But knowing the coordinates is very different from being able to interpret them.

  • Page:
    Include Georeference Information with Geospatial Coordinates

    This recommendation expands on the Include Datum Attributes for Data in Grid Structures.

    We recommend that Earth Science dataset granules be produced with complete georeferencing information for all their geospatial coordinates. This georeference information should be encoded in an interoperable way based on the CF convention and the following specific guidelines:

  • Page:
    Include Time Coordinate in Swath Structured Data

    We recommend that swath dataset files always include a time coordinate even when there is only one value.

    Recommendation Details: A time coordinate is required for a swath file when it contains data from many time instances. Sometimes swath files with data from a single time instance could be without a time coordinate because each file records that specific time value in either the file name or file-level attributes, or both. A time coordinate should be defined and used in all data variables that vary in time, regardless of the number of time instances. This will allow downstream users to more easily and efficiently aggregate data across separate files.

  • Page:
    Include Time Dimension in Grid Structured Data

    We recommend that datasets in grid structures include a Time dimension, even if Time is degenerate (i.e., includes only one value) for the cases when the entire grid has one time range or time stamp.

    Recommendation Details: A Time dimension is required for a single grid product file that contains many time intervals of daily, weekly, or monthly averages. In contrast, grid product files that are distributed as daily, weekly, or monthly granules that have one time range or stamp for the entire grid could be defined without a Time dimension because each file records the specific time interval being provided in both the file name and file-level attributes. We nevertheless recommend that a Time dimension be defined and used in all data fields that vary in time, regardless of whether multiple time slices are stored in the file. More specifically, we recommend that Time be defined as a record dimension, not a fixed-length dimension. This allows downstream users to more easily and efficiently aggregate data across separate files because HDF5 and netCDF4 storage geometries and APIs are designed to more easily extend record dimensions than fixed-length dimensions.

  • Page:
    Keep Coordinate Values in Coordinate Variables

    We recommend that all coordinate values be stored in coordinate variables. No coordinate values, or any part thereof, should be stored in attributes, variable names or group names.

    Recommendation Details: Coordinate values are essential for putting all the other data in the proper physical domain context. Storing coordinate values in coordinate variables improves the consistency of data access, especially by software, which is directly related to data interoperability. Storing any part of coordinate values in attributes, variable names or group names is strongly discouraged. For example, avoid encoding time coordinate data in group hierarchies like 2017/01/30 (these are three groups named 2017, 01, and 30, respectively).

  • Page:
    Make HDF5 files netCDF4-Compatible and CF-compliant within Groups

    We recommend that all HDF5 Earth Science product files be made netCDF4-compatible and CF-compliant within groups.

    Recommendation Details:

  • Page:
    Mapping between ACDD and ISO

    We recommend use of existing mapping between ACDD and ISO developed by ESIP.

    Recommendation Details: The ESIP Community supports a vast array of systems that are accessed and utilized by a diverse group of users. Historically, groups within the community have approached metadata differently in order to effectively describe their data. As a result, similar dialects have emerged to address specific user requirements. The multi-dialect approach described above hinders interoperability— as it results in different terminology being used to describe the same concepts. By clearly depicting fundamental documentation needs and concepts and mapping to them in the different dialects, confusion is minimized and interoperability is facilitated. Thus, demonstrating connections between dialects increases discoverability, accessibility, and reusability of data via consistent, compatible metadata.

  • Page:
    Maximize HDF5/netCDF4 interoperability via API accessibility

    We recommend that Earth Science data product files in HDF5 be designed to maximize netCDF4 interoperability by making such HDF5 files accessible from the netCDF4 API to the extent that this is possible.


    Recommendation Details: NASA data products based on Earth Science observations are typically in HDF, trending to HDF5, while Earth Science modelers generally prefer to produce data in netCDF, trending to netCDF4. It is not possible to make HDF4 files look exactly like netCDF3 files. On the other hand, netCDF4 is built on HDF5 (netCDF4 is essentially a subset of HDF5), and so it is possible to construct HDF5 files that are accessible from the netCDF4 API, which is a tremendous opportunity for interoperability. While using the netCDF4 API ensures this, the recommendation also provides guidance for those using the HDF5 API to ensure netCDF4 interoperability.

  • Page:
    Not-a-Number (NaN) Value

    We recommend Earth Science data products avoid using Not-a-Number (NaN) in any field values or as an indicator of missing or invalid data.

    Recommendation Details: The Institute of Electrical and Electronics Engineers (IEEE) floating-point standard defines the NaN (Not-a-Number) bit-patterns to represent results of illegal or undefined operations. Unless carefully written, any arithmetic operation involving NaN values can halt a program. Furthermore, any relational operator with at least one NaN value operand must evaluate to False. These properties make NaN values difficult to handle in numerical software and reduce the interoperability of datasets that contain NaN.

  • Page:
    Order Dimensions to Facilitate Readability of Grid Structure Datasets

    We recommend that the dimensions in grid structure datasets be ordered in a manner that facilitates readability for the anticipated end users.

    Recommendation Details: The order of the dimensions in a group structure should be carefully considered, since this can have significant impact on the ease with which an end user can read the data in the group structure. While tools such as NCO's ncpdq can re-order dataset dimensions, thereby permitting dataset designers to test the effects of their ordering choices against common access patterns, re-ordering large datasets can be time consuming and is best avoided.

  • Page:
    Standardize File Extensions for HDF5/netCDF Files

    We recommend using standardized file name extensions for HDF5 and netCDF files, as follows:

    • .h5 for files created with the HDF5 API;
    • .nc for files created with the netCDF API; and
  • Page:
    Use CF Bounds Attributes

    We recommend that spatio-temporal and other coordinate boundaries be specified by adding CF "bounds" attributes.

    Recommendation Details: The CF conventions are widely employed guidelines for Earth Science data and metadata storage. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the following ways: Each variable in the file has an associated description of what it represents, including physical units if appropriate; and each value can be located in space (relative to Earth-based coordinates) and time. Thus, adhering to CF guidelines will increase completeness, consistency, and interoperability of conforming datasets.

  • Page:
    Use the Units Attribute Only for Variables with Physical Units

    We recommend adhering to the CF convention's guidance on the use of the units attribute with the following clarifications:

    • Unitless (dimensionless in the physical sense) property of the data in a variable is indicated by the lack of a units attribute, unless:
  • Page:
    Verify CF compliance

    We recommend that CF compliance of NASA-distributed HDF/netCDF files be verified.

    Recommendation Details: The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage, which help tool developers find and/or interpret data values, coordinates, units, and measurements in data files. Thus, it is increasingly important to adhere to CF conventions in order to benefit from analysis tools, web services, and middleware that exploit them.

  • Page:
    When to Employ Packing Attributes

    We recommend that packing attributes (i.e., scale_factor  and add_offset ) be employed only when data are packed as integers.

    Recommendation Details: Packing refers to a lossy means of data compression that typically works by converting floating point data to an integer representation that requires fewer bytes for storage. The packing attributes scale_factor  and add_offset  are the netCDF (and CF) standard names for the parameters of the packing and unpacking algorithms. If scale_factor  is 1.0 and add_offset  is 0.0, the packed value and the unpacked value are identical, although their datatype (float or integer) may differ. Unfortunately, many datasets annotate floating point variables with the attributes, apparently for completeness, even though the variables have not been packed and remain as floating point values. Incorporating packing attributes on data that have not been packed is a misuse of the packing standard and it should be avoided. Data analysis software that encounters packing attributes on data that are not packed is liable to be confused and perform in unexpected ways. Packed data must be represented as integers, and only integer types should have packing attributes.

Additional Recommendations Awaiting Formal ESCO Approval

  • Page:
    Attach the CF flag_values or flag_masks Attributes Along With the CF flag_meanings Attribute to Each Flag Variable

    Attach the CF flag_values or flag_masks attributes along with the CF flag_meanings attribute to each flag variable in an Earth Science data product.  The choice of which to use depends on the use case.

    Recommendation Details:

  • Page:
    Avoid Use of the missing_value Attribute

    Avoid use of the missing_value attribute in new Earth Science data products.

    Recommendation Details: The missing_value attribute has been semi deprecated, and so this attribute should not be used in new Earth Science data products.

  • Page:
    Define the Projection Ellipsoid to Match the Reference Datum

    Define the projection ellipsoid to match the reference datum to minimize potential errors in geolocation and reprojection.

    Recommendation Details: When producing geolocated image data derived from satellite-based or airborne remote sensing instruments, we recommend defining the projection ellipsoid to be the same as the datum used by the remote sensing system to define geodetic latitude and longitude.  Specific details depend on the selected file format and metadata conventions.  Examples provided in this recommendation use GeoTIFF terminology, but the recommendation also applies to other formats and metadata conventions.  For example, when using GeoTIFF to represent content in a Projected Coordinate Reference System (PCRS), the projection ellipsoid should be the same as the Geodetic Reference Frame (datum) used by the remote sensing system.  For many currently operating satellite instruments, the reported geolocation is referenced to the World Geodetic System (WGS) 1984 datum.  Airborne instruments that are geolocated using GPS instruments are also referenced to WGS84.  When geolocated data from one of these instruments are used to create derived geophysical products, data producers may choose a PCRS that includes a map projection based on a reference ellipsoid.  To ensure maximum interoperability when transforming such data products, we recommend choosing the PCRS map projection ellipsoid to match the underlying Geodetic Reference Frame (datum).  This will minimize potential for geolocation error with overlays of related geolocated information such as coastlines or comparison data products. 

  • Page:
    Document Missing Granules for Instruments That Acquire Data on a Regular Basis

    A list of all missing granules should be provided on a permanent Web site (e.g., the DOI landing page) for each Earth Science data product.

    Recommendation Details: It is not uncommon for an Earth Science product to be missing some data granules.  For example, data gaps can occur because of missing L0/1 data (e.g., the instrument was not operating, the instrument was in a mode that did not produce useful observations, or a permanent loss of telemetry occurred).

  • Page:
    Include Only One Variable per GeoTIFF File

    Include only one variable per GeoTIFF File.

    Recommendation Details: The GeoTIFF format was initially developed during the early 1990’s with the objective being to leverage a mature platform independent file format (TIFF) by adding metadata required for describing and using geographic image data (OGC, 2019).

  • Page:
    Indicate in CRS Metadata the Order of Elements in Horizontal Coordinate Pairs

    Indicate in CRS metadata the order of latitude and longitude in coordinate pairs in Earth Science data products.

    Recommendation Details: There is no universal agreement regarding the order of horizontal coordinate pairs (i.e., (longitude, latitude) vs. (latitude, longitude)) in Earth Science data products.  Axis ordering may be specified in the full description of the Coordinate Reference System (CRS) as given in a registry such as EPSG.  If the order is not specified in a registered CRS, or the CRS is not in a registry, we recommend using the optional axis order keyword in the well-known text (WKT) representation of a CRS (ISO 19162:2019).  The order keyword can be added after the mandatory direction keyword as shown in this example:

  • Page:
    Make a Variable's Valid Data Range Consistent Within Each Product Release

    The valid data range for each variable in an Earth Science data product should be made consistent within each product release, and should not vary file-to-file within a given product release.

    Recommendation Details: There are cases of published Earth Science data products with valid data ranges for some variables (specified via the CF valid_min  and valid_max attributes, or via the CF valid_range attribute) that vary file-to-file, based on the actual data range for each particular variable within each particular product file - we do not agree with this approach.

  • Page:
    Make a Variable's Valid Data Range Useful

    The valid range for each variable in an Earth Science data product should put useful constraints on the data.

    Recommendation Details: Declaring the valid range of a variable's data according to the CF metadata conventions is part of an earlier DIWG recommendation (see Recommendation 2.1 of ESDS-RFC-028).  The data value range can be specified either by two CF attributes, valid_min  and valid_max , or via the valid_range CF attribute.  Only one of these approaches should be used for a given variable.

  • Page:
    Use a Number Outside of the Valid Data Range for a Variable's Fill Value

    The fill value of a variable should be a number outside its valid data range.

    Recommendation Details: The CF _FillValue attribute is used to indicate missing or invalid data for a variable.  Also, the value of the CF _FillValue attribute should match the actual fill value used for the variable in the file.

  • Page:
    Use DOIs for Referencing Documentation

    A space-separated list of documentation DOIs should be used in the CF references attribute in Earth Science data products, both globally and for specific variables.

    Recommendation Details: The CF references attribute is useful for storing information regarding documentation in Earth Science data products.  The CF references attribute can exist both globally and at the variable level.  The most concise way to reference a document is via its DOI.  We suggest that a space-separated list of documentation DOIs should be used in the CF references attribute in Earth Science data products.  Use of the URL form of the DOI is strongly recommended.  Also, URLs of relevant documents that do not have DOIs can be used in the CF references attribute.

  • Page:
    Use Double Precision When Archiving Time in Seconds Since a Specific Epoch

    Use double precision when archiving time in seconds since a specific epoch.

    Recommendation Details: Earth Science data products must preserve time-related information with sufficient precision to resolve all timescales relevant to the data itself, to other data with which it may be intercompared, and to conventions for the numeric representation of time, such as Coordinated Universal Time (UTC).  Geoscientific datasets commonly report time in intervals (such as seconds) measured from a particular epoch.  Resolving one second on the 50+-year timescale from the UNIX/POSIX epoch (00:00:00 UTC on 1 January 1970) to the present day can require up to ten significant digits of temporal resolution, whereas the IEEE-754 single-precision (32 bit) floating point representations preserves at most seven significant digits.  Resolving time to the nearest microsecond can require up to six more digits, for a total of sixteen digits, approximately the maximum precision of an IEEE-754 double-precision (64-bit) floating point number.  Therefore, preserving sufficient temporal precision to label, store, and intercompare geoscientific data requires double-precision storage.

  • Page:
    Use Only Officially Supported Compression Filters on NetCDF-4 and NetCDF-4-Compatible HDF5 Data

    Only compression filters that are officially supported by a default installation of the current netCDF-4 software distribution should be used in Earth Science data products in netCDF-4 or netCDF-4-compatible HDF5 formats.

    Recommendation Details: NetCDF4 has enabled access to non-default (i.e., non-DEFLATE) HDF5 compression filters starting from version 4.7.0.  However, the filter identification and access are currently obscure (~five digit IDs) and non-portable (no guarantees client software will be able to decompress them). DEFLATE is currently the only compression filter that is guaranteed to work with default (non-customized) netCDF4 installations, and so DEFLATE is the only compression filter that should be used in interoperable Earth Science data products in netCDF-4 or netCDF-4-compatible HDF5 formats. Use of the shuffle filter is not prohibited since it is not a compression filter and is supported by the netCDF4 default installation. Combining the shuffle and the DEFLATE filters can noticeably improve the data compression ratio.