Page tree
Skip to end of metadata
Go to start of metadata

This page contains recommendations adopted by the NASA Earth Science Data System (ESDS) Dataset Interoperability Working Group (DIWG). They are meant to increase and enhance the interoperability of Earth Science data product files by improving their compliance, discoverability, and extensibility with relevant metadata conventions. These recommendations are extracts from the recommendations documents published by the ESDIS Standards Office (ESO). They have been copied verbatim with only formatting changes. If there are any differences between the two versions, the ESO-published documents should be considered as accurate.

This page will be updated as new recommendations are adopted.

Glossary

 Click here to see...
ACDDAttribute Convention for Dataset Discovery
ASCIIAmerican Standard Code for Information Interchange
attributeHDF5 attribute or netCDF attribute
APIApplication Program Interface
CCBConfiguration Change Board
CCRConfiguration Change Request
CDMUnidata Common Data Model
CFClimate and Forecast Metadata Conventions
CRSCoordinate Reference System
COARDSCooperative Ocean/Atmospheric Research Data Service
DAACDistributed Active Archive Center
DIWGDataset Interoperability Working Group
EOSDISEarth Observing System Data and Information System
ESDSEarth Science Data Systems
ESDISEarth Science Data and Information System
ESDSWGEarth Science Data System Working Groups
ESIPFederation of Earth Science Information Partners
ESOESDIS Standards Office
groupHDF5 group or netCDF group
GSFCGoddard Space Flight Center
HDFHierarchical Data Format
HDF4Hierarchical Data Format, version 4
HDF5Hierarchical Data Format, version 5
HDF-EOSHierarchical Data Format - Earth Observing System
HDF-EOS5Hierarchical Data Format - Earth Observing System, version 5
HPDHDF Product Designer
IEEEInstitute of Electrical and Electronics Engineers
ISOInternational Organization for Standardization
JPLJet Propulsion Laboratory
MCCMetadata Compliance Checker
NASANational Aeronautics and Space Administration
NaNNot-a-Number
ncksnetCDF Kitchen Sink
NCOnetCDF Operator
ncpdqnetCDF Permute Dimensions Quickly
ncwanetCDF Weighted Averager
netCDFNetwork Common Data Form
netCDF3Network Common Data Form, version 3
netCDF4Network Common Data Form, version 4
NUGNetCDF User Guide
OGCOpen Geospatial Consortium
OPeNDAPOpen-source Project for a Network Data Access Protocol
RFCRequest For Comments
THREDDSThematic Real-time Environmental Distributed Data Services
UTCUniversal Time Coordinated (also: Coordinated Universal Time)
UTF-8Unicode Transformation Format—8-bit
variableHDF5 dataset or netCDF variable
WGS-84World Geodetic System 84
WKTWell-Known Text
XMLExtensible Markup Language

Recommendations Approved by ESO

  • Page:
    Adopt Semantically Rich Dataset Release Identifiers

    We recommend that dataset release (version) identifiers can at least represent the following information:

    • Changes in the file content that only improve its compliance with the already published documentation ("patch" release).
  • Page:
    Character Set for User-Defined Group, Variable, and Attribute names

    We recommend that user-defined group, variable, and attribute names follow the Climate and Forecast (CF) convention's specification. The names shall comply with this regular expression: [A-Za-z][A-Za-z0-9_]* . Exempt are system-defined names for any of these objects that are required by various APIs or conventions.

  • Page:
    Consider “balanced” chunking for 3-D datasets in grid structures

    We recommend that "balanced" chunking be considered for three-dimensional datasets in grid structures.

    Recommendation Details: If a dataset is exceptionally large, it is often more useful to break it up into manageable parts. This process is known as chunking and is used on data in datasets that are part of a grid structure. Exactly how the data chunking is done can greatly affect performance for the end user. Because the precise access pattern employed by the end user is usually unknown until the distributor analyzes sufficient requests to discern a pattern, it is difficult to determine the most effective way to chunk.

  • Page:
    Consistent Units Attribute Value for Variables Across One Data Collection

    We recommend the units attribute’s value for a particular variable if given should be the same across all files in a data collection.

    Recommendation Details: Knowing the physical units of data in a variable is vital for proper use. While the presence of the units attribute satisfies that requirement, its value may vary from one file to another for the same variable. For example, while these two values of the units attribute represent length: mm and μm, the software processing that variable’s data from different files may not take the attribute’s value change into account. Using the same units attribute value for one variable throughout a data collection decreases the chance of errors.

  • Page:
    Date-Time Information in Granule Filenames

    We recommend that date-time information in granule file names adheres to the following guidelines:

    • Adopt the ISO 8601 [11] standard for date-time information representation.
  • Page:
    Distinguish clearly between HDF and netCDF packing conventions

    We recommend that datasets with non-netCDF packing be clearly distinguished from datasets that use the netCDF packing convention.

    Recommendation Details: Earth Science observers and modelers often employ a technique called “packing” (a.k.a. “scaling’) to make their product files smaller. "Packed" datasets must be correctly "unpacked" before they can be used properly. Confusingly, non-netCDF (e.g., HDF4_CAL) and netCDF algorithms both store their parameters in attributes with the same or similar names – and unpacking one algorithm with the other will result in incorrect conversions. Many netCDF-based tools are equally unaware of the non-netCDF (e.g., HDF_CAL) packing cases and so interpret all readable data using the netCDF convention. Unfortunately, few users are aware that their datasets may be packed, and fewer know the details of the packing algorithm employed. This is an interoperability issue because it hampers data analysis performed on heterogeneous systems.

  • Page:
    Ensure Granule's Filename Uniqueness Across Different Dataset Releases

    We recommend that each granule belonging to an Earth Science dataset in a public archive have a unique file name across different dataset releases (versions, collections) to improve interoperability and avoid confusion. The minimum content to ensure unique granule file name consists of:

  • Page:
    Include Basic CF Attributes

    We recommend that, at minimum, the following basic Climate and Forecast (CF) Convention attributes be included in future NASA Earth Science data products where applicable.

    Recommendation Details : The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage. Included in the conventions is a comprehensive list of metadata attributes that are available for use by dataset producers. Because the list of metadata attributes is so extensive, dataset producers are constantly struggling with which metadata attributes to attach to a variable.

  • Page:
    Include Datum Attributes for Data in Grid Structures

    We recommend that Horizontal and Vertical (as necessary) Datum attributes be included for data in grid structures.

    Recommendation Details: Locations on Earth are specified using coordinates, which are tuples of numbers that describe the horizontal and vertical distances from a fixed point, thereby pinpointing a particular place on the map at some level of precision. But knowing the coordinates is very different from being able to interpret them.

  • Page:
    Include Georeference Information with Geospatial Coordinates

    This recommendation expands on the Include Datum Attributes for Data in Grid Structures.

    We recommend that Earth Science dataset granules be produced with complete georeferencing information for all their geospatial coordinates. This georeference information should be encoded in an interoperable way based on the CF convention and the following specific guidelines:

  • Page:
    Include Time Coordinate in Swath Structured Data

    We recommend that swath dataset files always include a time coordinate even when there is only one value.

    Recommendation Details: A time coordinate is required for a swath file when it contains data from many time instances. Sometimes swath files with data from a single time instance could be without a time coordinate because each file records that specific time value in either the file name or file-level attributes, or both. A time coordinate should be defined and used in all data variables that vary in time, regardless of the number of time instances. This will allow downstream users to more easily and efficiently aggregate data across separate files.

  • Page:
    Include Time Dimension in Grid Structured Data

    We recommend that datasets in grid structures include a Time dimension, even if Time is degenerate (i.e., includes only one value) for the cases when the entire grid has one time range or time stamp.

    Recommendation Details: A Time dimension is required for a single grid product file that contains many time intervals of daily, weekly, or monthly averages. In contrast, grid product files that are distributed as daily, weekly, or monthly granules that have one time range or stamp for the entire grid could be defined without a Time dimension because each file records the specific time interval being provided in both the file name and file-level attributes. We nevertheless recommend that a Time dimension be defined and used in all data fields that vary in time, regardless of whether multiple time slices are stored in the file. More specifically, we recommend that Time be defined as a record dimension, not a fixed-length dimension. This allows downstream users to more easily and efficiently aggregate data across separate files because HDF5 and netCDF4 storage geometries and APIs are designed to more easily extend record dimensions than fixed-length dimensions.

  • Page:
    Keep Coordinate Values in Coordinate Variables

    We recommend that all coordinate values be stored in coordinate variables. No coordinate values, or any part thereof, should be stored in attributes, variable names or group names.

    Recommendation Details: Coordinate values are essential for putting all the other data in the proper physical domain context. Storing coordinate values in coordinate variables improves the consistency of data access, especially by software, which is directly related to data interoperability. Storing any part of coordinate values in attributes, variable names or group names is strongly discouraged. For example, avoid encoding time coordinate data in group hierarchies like 2017/01/30 (these are three groups named 2017, 01, and 30, respectively).

  • Page:
    Make HDF5 files netCDF4-Compatible and CF-compliant within Groups

    We recommend that all HDF5 Earth Science product files be made netCDF4-compatible and CF-compliant within groups.

    Recommendation Details:

  • Page:
    Mapping between ACDD and ISO

    We recommend use of existing mapping between ACDD and ISO developed by ESIP.

    Recommendation Details: The ESIP Community supports a vast array of systems that are accessed and utilized by a diverse group of users. Historically, groups within the community have approached metadata differently in order to effectively describe their data. As a result, similar dialects have emerged to address specific user requirements. The multi-dialect approach described above hinders interoperability— as it results in different terminology being used to describe the same concepts. By clearly depicting fundamental documentation needs and concepts and mapping to them in the different dialects, confusion is minimized and interoperability is facilitated. Thus, demonstrating connections between dialects increases discoverability, accessibility, and reusability of data via consistent, compatible metadata.

  • Page:
    Maximize HDF5/netCDF4 interoperability via API accessibility

    We recommend that Earth Science data product files in HDF5 be designed to maximize netCDF4 interoperability by making such HDF5 files accessible from the netCDF4 API to the extent that this is possible.


    Recommendation Details: NASA data products based on Earth Science observations are typically in HDF, trending to HDF5, while Earth Science modelers generally prefer to produce data in netCDF, trending to netCDF4. It is not possible to make HDF4 files look exactly like netCDF3 files. On the other hand, netCDF4 is built on HDF5 (netCDF4 is essentially a subset of HDF5), and so it is possible to construct HDF5 files that are accessible from the netCDF4 API, which is a tremendous opportunity for interoperability. While using the netCDF4 API ensures this, the recommendation also provides guidance for those using the HDF5 API to ensure netCDF4 interoperability.

  • Page:
    Not-a-Number (NaN) Value

    We recommend Earth Science data products avoid using Not-a-Number (NaN) in any field values or as an indicator of missing or invalid data.

    Recommendation Details: The Institute of Electrical and Electronics Engineers (IEEE) floating-point standard defines the NaN (Not-a-Number) bit-patterns to represent results of illegal or undefined operations. Unless carefully written, any arithmetic operation involving NaN values can halt a program. Furthermore, any relational operator with at least one NaN value operand must evaluate to False. These properties make NaN values difficult to handle in numerical software and reduce the interoperability of datasets that contain NaN.

  • Page:
    Order Dimensions to Facilitate Readability of Grid Structure Datasets

    We recommend that the dimensions in grid structure datasets be ordered in a manner that facilitates readability for the anticipated end users.

    Recommendation Details: The order of the dimensions in a group structure should be carefully considered, since this can have significant impact on the ease with which an end user can read the data in the group structure. While tools such as NCO's ncpdq can re-order dataset dimensions, thereby permitting dataset designers to test the effects of their ordering choices against common access patterns, re-ordering large datasets can be time consuming and is best avoided.

  • Page:
    Standardize File Extensions for HDF5/netCDF Files

    We recommend using standardized file name extensions for HDF5 and netCDF files, as follows:

    • .h5 for files created with the HDF5 API;
    • .nc for files created with the netCDF API; and
  • Page:
    Use CF Bounds Attributes

    We recommend that spatio-temporal and other coordinate boundaries be specified by adding CF "bounds" attributes.

    Recommendation Details: The CF conventions are widely employed guidelines for Earth Science data and metadata storage. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the following ways: Each variable in the file has an associated description of what it represents, including physical units if appropriate; and each value can be located in space (relative to Earth-based coordinates) and time. Thus, adhering to CF guidelines will increase completeness, consistency, and interoperability of conforming datasets.

  • Page:
    Use the Units Attribute Only for Variables with Physical Units

    We recommend adhering to the CF convention's guidance on the use of the units attribute with the following clarifications:

    • Unitless (dimensionless in the physical sense) property of the data in a variable is indicated by the lack of a units attribute, unless:
  • Page:
    Verify CF compliance

    We recommend that CF compliance of NASA-distributed HDF/netCDF files be verified.

    Recommendation Details: The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage, which help tool developers find and/or interpret data values, coordinates, units, and measurements in data files. Thus, it is increasingly important to adhere to CF conventions in order to benefit from analysis tools, web services, and middleware that exploit them.

  • Page:
    When to Employ Packing Attributes

    We recommend that packing attributes (i.e., scale_factor  and add_offset ) be employed only when data are packed as integers.

    Recommendation Details: Packing refers to a lossy means of data compression that typically works by converting floating point data to an integer representation that requires fewer bytes for storage. The packing attributes scale_factor  and add_offset  are the netCDF (and CF) standard names for the parameters of the packing and unpacking algorithms. If scale_factor  is 1.0 and add_offset  is 0.0, the packed value and the unpacked value are identical, although their datatype (float or integer) may differ. Unfortunately, many datasets annotate floating point variables with the attributes, apparently for completeness, even though the variables have not been packed and remain as floating point values. Incorporating packing attributes on data that have not been packed is a misuse of the packing standard and it should be avoided. Data analysis software that encounters packing attributes on data that are not packed is liable to be confused and perform in unexpected ways. Packed data must be represented as integers, and only integer types should have packing attributes.

Additional Recommendations Awaiting Formal ESO Approval

  • Page:
    Document Missing Granules for Instruments that Acquire Data on a Regular Basis

    A list of all missing granules should be provided on a permanent Web site (e.g., the DOI landing page) for each Earth Science data product.

    Recommendation Details: It is not uncommon for an Earth Science product to be missing some data granules.  For example, data gaps can occur because of missing L0/1 data (e.g., the instrument was not operating, the instrument was in a mode that did not produce useful observations, or a permanent loss of telemetry occurred).

  • Page:
    Make a Variable's Valid Data Range Useful

    The valid range for each variable in an Earth Science data product should put useful constraints on the data.

    Recommendation Details: Declaring the valid range of a variable's data according to the CF metadata conventions is part of an earlier DIWG recommendation (see Recommendation 2.1 of ESDS-RFC-028).  The data value range can be specified either by two CF attributes, valid_min  and valid_max , or via the valid_range CF attribute.  Only one of these approaches should be used for a given variable.

  • Page:
    Use a Number Outside of the Valid Data Range for a Variable's Fill Value

    The fill value of a variable should be a number outside its valid data range.

    Recommendation Details: The CF _FillValue attribute is used to indicate missing or invalid data for a variable.  Also, the value of the CF _FillValue attribute should match the actual fill value used for the variable in the file.

  • Page:
    Use DOIs for Referencing Documentation

    A space-separated list of documentation DOIs should be used in the CF references attribute in Earth Science data products, both globally and for specific variables.

    Recommendation Details: The CF references attribute is useful for storing information regarding documentation in Earth Science data products.  The CF references attribute can exist both globally and at the variable level.  The most concise way to reference a document is via its DOI.  We suggest that a space-separated list of documentation DOIs should be used in the CF references attribute in Earth Science data products.  Use of the URL form of the DOI is strongly recommended.  Also, URLs of relevant documents that do not have DOIs can be used in the CF references attribute.

  • Page:
    Use Only Officially Supported Compression Filters on NetCDF4 and NetCDF4-Compatible HDF5 Data

    Only compression filters that are officially supported by a default installation of the current netCDF4 software distribution should be used in Earth Science data products in netCDF4 or netCDF4-compatible HDF5 formats.

    Recommendation Details: NetCDF4 has enabled access to non-default (i.e., non-DEFLATE) HDF5 compression filters starting from version 4.7.0.  However, the filter identification and access are currently obscure (~five digit IDs) and non-portable (no guarantees client software will be able to decompress them). DEFLATE is currently the only compression filter that is guaranteed to work with default (non-customized) netCDF4 installations, and so DEFLATE is the only compression filter that should be used in interoperable Earth Science data products in netCDF4 or netCDF4-compatible HDF5 formats. Use of the shuffle filter is not prohibited since it is not a compression filter and is supported by the netCDF4 default installation. Combining the shuffle and the DEFLATE filters can noticeably improve the data compression ratio.