This page contains recommendations adopted by the NASA Earth Science Data System (ESDS) Dataset Interoperability Working Group (DIWG). They are meant to increase and enhance the interoperability of Earth Science data product files by improving their compliance, discoverability, and extensibility with relevant metadata conventions. These recommendations are extracts from the recommendations documents published by the ESDIS Standards Office (ESO). They have been copied verbatim with only formatting changes. If there are any differences between the two versions, the ESO-published documents should be considered as accurate.
This page will be updated as new recommendations are adopted.
|ACDD||Attribute Convention for Dataset Discovery|
|ASCII||American Standard Code for Information Interchange|
|attribute||HDF5 attribute or netCDF attribute|
|API||Application Program Interface|
|CCB||Configuration Change Board|
|CCR||Configuration Change Request|
|CDM||Unidata Common Data Model|
|CF||Climate and Forecast Metadata Conventions|
|CRS||Coordinate Reference System|
|COARDS||Cooperative Ocean/Atmospheric Research Data Service|
|DAAC||Distributed Active Archive Center|
|DIWG||Dataset Interoperability Working Group|
|EOSDIS||Earth Observing System Data and Information System|
|ESDS||Earth Science Data Systems|
|ESDIS||Earth Science Data and Information System|
|ESDSWG||Earth Science Data System Working Groups|
|ESIP||Federation of Earth Science Information Partners|
|ESO||ESDIS Standards Office|
|group||HDF5 group or netCDF group|
|GSFC||Goddard Space Flight Center|
|HDF||Hierarchical Data Format|
|HDF4||Hierarchical Data Format, version 4|
|HDF5||Hierarchical Data Format, version 5|
|HDF-EOS||Hierarchical Data Format - Earth Observing System|
|HDF-EOS5||Hierarchical Data Format - Earth Observing System, version 5|
|HPD||HDF Product Designer|
|IEEE||Institute of Electrical and Electronics Engineers|
|ISO||International Organization for Standardization|
|JPL||Jet Propulsion Laboratory|
|MCC||Metadata Compliance Checker|
|NASA||National Aeronautics and Space Administration|
|ncks||netCDF Kitchen Sink|
|ncpdq||netCDF Permute Dimensions Quickly|
|ncwa||netCDF Weighted Averager|
|netCDF||Network Common Data Form|
|netCDF3||Network Common Data Form, version 3|
|netCDF4||Network Common Data Form, version 4|
|NUG||NetCDF User Guide|
|OGC||Open Geospatial Consortium|
|OPeNDAP||Open-source Project for a Network Data Access Protocol|
|RFC||Request For Comments|
|THREDDS||Thematic Real-time Environmental Distributed Data Services|
|UTC||Universal Time Coordinated (also: Coordinated Universal Time)|
|UTF-8||Unicode Transformation Format—8-bit|
|variable||HDF5 dataset or netCDF variable|
|WGS-84||World Geodetic System 84|
|XML||Extensible Markup Language|
Recommendations Approved by ESO
Consider “balanced” chunking for 3-D datasets in grid structures —
Recommendation Details: If a dataset is exceptionally large, it is often more useful to break it up into manageable parts. This process is known as chunking and is used on data in datasets that are part of a grid structure. Exactly how the data chunking is done can greatly affect performance for the end user. Because the precise access pattern employed by the end user is usually unknown until the distributor analyzes sufficient requests to discern a pattern, it is difficult to determine the most effective way to chunk.
Consistent Units Attribute Value for Variables Across One Data Collection —
Recommendation Details: Knowing the physical units of data in a variable is vital for proper use. While the presence of the
unitsattribute satisfies that requirement, its value may vary from one file to another for the same variable. For example, while these two values of the
unitsattribute represent length:
μm, the software processing that variable’s data from different files may not take the attribute’s value change into account. Using the same
unitsattribute value for one variable throughout a data collection decreases the chance of errors.
Distinguish clearly between HDF and netCDF packing conventions —
Recommendation Details: Earth Science observers and modelers often employ a technique called “packing” (a.k.a. “scaling’) to make their product files smaller. "Packed" datasets must be correctly "unpacked" before they can be used properly. Confusingly, non-netCDF (e.g., HDF4_CAL) and netCDF algorithms both store their parameters in attributes with the same or similar names – and unpacking one algorithm with the other will result in incorrect conversions. Many netCDF-based tools are equally unaware of the non-netCDF (e.g., HDF_CAL) packing cases and so interpret all readable data using the netCDF convention. Unfortunately, few users are aware that their datasets may be packed, and fewer know the details of the packing algorithm employed. This is an interoperability issue because it hampers data analysis performed on heterogeneous systems.
Include Basic CF Attributes —
Recommendation Details : The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage. Included in the conventions is a comprehensive list of metadata attributes that are available for use by dataset producers. Because the list of metadata attributes is so extensive, dataset producers are constantly struggling with which metadata attributes to attach to a variable.
Include Datum Attributes for Data in Grid Structures —
Recommendation Details: Locations on Earth are specified using coordinates, which are tuples of numbers that describe the horizontal and vertical distances from a fixed point, thereby pinpointing a particular place on the map at some level of precision. But knowing the coordinates is very different from being able to interpret them.
Include Time Coordinate in Swath Structured Data —
Recommendation Details: A time coordinate is required for a swath file when it contains data from many time instances. Sometimes swath files with data from a single time instance could be without a time coordinate because each file records that specific time value in either the file name or file-level attributes, or both. A time coordinate should be defined and used in all data variables that vary in time, regardless of the number of time instances. This will allow downstream users to more easily and efficiently aggregate data across separate files.
Include Time Dimension in Grid Structured Data —
Recommendation Details: A Time dimension is required for a single grid product file that contains many time intervals of daily, weekly, or monthly averages. In contrast, grid product files that are distributed as daily, weekly, or monthly granules that have one time range or stamp for the entire grid could be defined without a Time dimension because each file records the specific time interval being provided in both the file name and file-level attributes. We nevertheless recommend that a Time dimension be defined and used in all data fields that vary in time, regardless of whether multiple time slices are stored in the file. More specifically, we recommend that Time be defined as a record dimension, not a fixed-length dimension. This allows downstream users to more easily and efficiently aggregate data across separate files because HDF5 and netCDF4 storage geometries and APIs are designed to more easily extend record dimensions than fixed-length dimensions.
Keep Coordinate Values in Coordinate Variables —
Recommendation Details: Coordinate values are essential for putting all the other data in the proper physical domain context. Storing coordinate values in coordinate variables improves the consistency of data access, especially by software, which is directly related to data interoperability. Storing any part of coordinate values in attributes, variable names or group names is strongly discouraged. For example, avoid encoding time coordinate data in group hierarchies like
2017/01/30(these are three groups named
Mapping between ACDD and ISO —
Recommendation Details: The ESIP Community supports a vast array of systems that are accessed and utilized by a diverse group of users. Historically, groups within the community have approached metadata differently in order to effectively describe their data. As a result, similar dialects have emerged to address specific user requirements. The multi-dialect approach described above hinders interoperability— as it results in different terminology being used to describe the same concepts. By clearly depicting fundamental documentation needs and concepts and mapping to them in the different dialects, confusion is minimized and interoperability is facilitated. Thus, demonstrating connections between dialects increases discoverability, accessibility, and reusability of data via consistent, compatible metadata.
Maximize HDF5/netCDF4 interoperability via API accessibility —
Recommendation Details: NASA data products based on Earth Science observations are typically in HDF, trending to HDF5, while Earth Science modelers generally prefer to produce data in netCDF, trending to netCDF4. It is not possible to make HDF4 files look exactly like netCDF3 files. On the other hand, netCDF4 is built on HDF5 (netCDF4 is essentially a subset of HDF5), and so it is possible to construct HDF5 files that are accessible from the netCDF4 API, which is a tremendous opportunity for interoperability. While using the netCDF4 API ensures this, the recommendation also provides guidance for those using the HDF5 API to ensure netCDF4 interoperability.
Not-a-Number (NaN) Value —
Recommendation Details: The Institute of Electrical and Electronics Engineers (IEEE) floating-point standard defines the NaN (Not-a-Number) bit-patterns to represent results of illegal or undefined operations. Unless carefully written, any arithmetic operation involving NaN values can halt a program. Furthermore, any relational operator with at least one NaN value operand must evaluate to False. These properties make NaN values difficult to handle in numerical software and reduce the interoperability of datasets that contain NaN.
Order Dimensions to Facilitate Readability of Grid Structure Datasets —
Recommendation Details: The order of the dimensions in a group structure should be carefully considered, since this can have significant impact on the ease with which an end user can read the data in the group structure. While tools such as NCO's ncpdq can re-order dataset dimensions, thereby permitting dataset designers to test the effects of their ordering choices against common access patterns, re-ordering large datasets can be time consuming and is best avoided.
Use CF Bounds Attributes —
Recommendation Details: The CF conventions are widely employed guidelines for Earth Science data and metadata storage. The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata that they are self-describing in the following ways: Each variable in the file has an associated description of what it represents, including physical units if appropriate; and each value can be located in space (relative to Earth-based coordinates) and time. Thus, adhering to CF guidelines will increase completeness, consistency, and interoperability of conforming datasets.
Verify CF compliance —
Recommendation Details: The Climate and Forecast (CF) Conventions are widely employed guidelines for Earth Science data and metadata storage, which help tool developers find and/or interpret data values, coordinates, units, and measurements in data files. Thus, it is increasingly important to adhere to CF conventions in order to benefit from analysis tools, web services, and middleware that exploit them.
When to Employ Packing Attributes —
Recommendation Details: Packing refers to a lossy means of data compression that typically works by converting floating point data to an integer representation that requires fewer bytes for storage. The packing attributes
add_offsetare the netCDF (and CF) standard names for the parameters of the packing and unpacking algorithms. If
scale_factoris 1.0 and
add_offsetis 0.0, the packed value and the unpacked value are identical, although their datatype (float or integer) may differ. Unfortunately, many datasets annotate floating point variables with the attributes, apparently for completeness, even though the variables have not been packed and remain as floating point values. Incorporating packing attributes on data that have not been packed is a misuse of the packing standard and it should be avoided. Data analysis software that encounters packing attributes on data that are not packed is liable to be confused and perform in unexpected ways. Packed data must be represented as integers, and only integer types should have packing attributes.
Additional Recommendations Awaiting Formal ESO Approval
Document Missing Granules for Instruments that Acquire Data on a Regular Basis —
Recommendation Details: It is not uncommon for an Earth Science product to be missing some data granules. For example, data gaps can occur because of missing L0/1 data (e.g., the instrument was not operating, the instrument was in a mode that did not produce useful observations, or a permanent loss of telemetry occurred).
Make a Variable's Valid Data Range Useful —
Recommendation Details: Declaring the valid range of a variable's data according to the CF metadata conventions is part of an earlier DIWG recommendation (see Recommendation 2.1 of ESDS-RFC-028). The data value range can be specified either by two CF attributes,
valid_max, or via the
valid_rangeCF attribute. Only one of these approaches should be used for a given variable.
Use a Number Outside of the Valid Data Range for a Variable's Fill Value —
Recommendation Details: The CF
_FillValueattribute is used to indicate missing or invalid data for a variable. Also, the value of the CF
_FillValueattribute should match the actual fill value used for the variable in the file.
Use DOIs for Referencing Documentation —
Recommendation Details: The CF
referencesattribute is useful for storing information regarding documentation in Earth Science data products. The CF
referencesattribute can exist both globally and at the variable level. The most concise way to reference a document is via its DOI. We suggest that a space-separated list of documentation DOIs should be used in the CF
referencesattribute in Earth Science data products. Use of the URL form of the DOI is strongly recommended. Also, URLs of relevant documents that do not have DOIs can be used in the CF
Use Only Officially Supported Compression Filters on NetCDF4 and NetCDF4-Compatible HDF5 Data —
Recommendation Details: NetCDF4 has enabled access to non-default (i.e., non-DEFLATE) HDF5 compression filters starting from version 4.7.0. However, the filter identification and access are currently obscure (~five digit IDs) and non-portable (no guarantees client software will be able to decompress them). DEFLATE is currently the only compression filter that is guaranteed to work with default (non-customized) netCDF4 installations, and so DEFLATE is the only compression filter that should be used in interoperable Earth Science data products in netCDF4 or netCDF4-compatible HDF5 formats. Use of the shuffle filter is not prohibited since it is not a compression filter and is supported by the netCDF4 default installation. Combining the shuffle and the DEFLATE filters can noticeably improve the data compression ratio.