CMR Harvesting Best Practices

There are many different use cases related to harvesting CMR collection and granule level metadata. The CMR has added a few new features to better support the use cases in a way that the following criteria are met:

Large result sets can be retrieved
While iterating through result sets, the results remain consistent
Performance of other queries in the system are unaffected
Changes to inventory are easily discoverable

After introducing the scrolling concept we will walk through 3 categories of use cases related to harvesting:

Populating External Systems
Capturing Inventory Changes
Synchronizing External Systems with the CMR

CMR Search Basics

This document assumes that the reader is familiar with basic usage of the CMR Search API. You can find more information about the CMR Search API here - CMR Client Partner User Guide, and API documentation here - https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html.

Scrolling

Before jumping into the use cases, we should first introduce the scrolling feature on the CMR Search API. Many of the use cases will involve iterating through large results sets of data, and the way that is accomplished via the CMR APIs is by using the scroll query parameter.

Those familiar with the CMR API have potentially used the paging parameters page_num and page_size or offset. There are two downsides to paging that make it not suited well for harvesting:

In between search calls data can be ingested or deleted which means that the same result may show up in multiple calls or data can be missed when going through the pages.
Deep paging is inefficient for our Elasticsearch system. Queries for a high offset will take longer and can impact the performance of other queries in the system. As a result the CMR rejects requests for paging beyond the one millionth result for a search.

Scrolling addresses both of these issues and is the recommended way to harvest CMR metadata.

The scroll query parameter can be used on both the collections and granules search endpoints.

scroll - A boolean flag (true/false) that allows all results to be retrieved efficiently. page_size is supported with scroll while page_num and offset are not. If scroll is true then the first call of a scroll session sets the page size; page_size is ignored on subsequent calls.

Scrolling is only supported for parameter queries, but all query parameters are available with the exception of the page_num and offset parameters. The response format for scrolling queries is identical to the response for normal parameter queries with the exception of the addition of the CMR-Scroll-Id header. The CMR-Hits header is useful for determining the number of requests that will be needed to retrieve all the available results.

Scrolling is session based; the first search conducted with the scroll parameter set to true will return a session id in the form of a CMR-Scroll-Id header. This header should be included in subsequent searches until the desired number of results have been retrieved. Sessions time out after 10 minutes of inactivity; each new query before the timeout is reached with a given CMR-Scroll-Id header will reset the timeout to 10 minutes. Queries occurring after a session has timed out will result in an HTTP 404 status code and error message.

When all the results have been returned subsequent calls using the same CMR-Scroll-Id header will return an empty list.

Limitations: Scrolling through granules requires either a collection identifier or provider id to be included in the query to ensure that performance of other queries are not impacted by a scroll query.

Example

As a harvesting client, I want to retrieve all collection metadata.

curl -i "https://cmr.earthdata.nasa.gov/search/collections?page_size=2000&scroll=true"

We are using a page size of 2000 because that is the maximum allowed page size for a single request for the CMR. This query will initiate our scrolling session and return the first 2000 results. There are two headers returned that I will pay attention to:

CMR-Hits: 32636 tells me how results there are total. I know that I will need to scroll 16 times to get through all of the results. I don't actually need to know this because I will just scroll until no more results are returned, but can be useful for someone just wanting to know how many results there are.

CMR-Scroll-Id: -1261131662 gives me back the scrolling session ID that I will need to supply to all subsequent queries.

I will continue paging through the results until no more results are returned

curl -i -H "CMR-Scroll-Id: -1261131662" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"

My queries are all returning collection references. If I wanted to choose a different format I would use a different extension:

For example for the original metadata format for every collection I can call https://cmr.earthdata.nasa.gov/search/collections.native?scroll=true or for echo10 https://cmr.earthdata.nasa.gov/search/collections.echo10?scroll=true.

The full list of supported formats is documented here: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#supported-result-formats

Populating External Systems

This category of use cases is for clients want to retrieve all or some subset of the metadata from the CMR in order to support an initial load of their system.

Use Cases

As a harvesting client, I want to retrieve all collection metadata.

Documented in scrolling example above.

As a harvesting client, I want to retrieve all collection metadata based on a given tag (CWIC, FedEO)

Refer to the CMR Search API Documentation for a detailed explanation of tag features.

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/collections?tag_key=org.ceos.wgiss.cwic.quality&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"

As a harvesting client, I want to retrieve all granule metadata for a given collection

Using C193529899-LPDAAC_ECS as the example collection.

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/granules?collection_concept_id=C193529899-LPDAAC_ECS&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/granules?scroll=true"

Capturing Inventory Changes

Clients want to make sure they have the most recent data and have removed any data that is no longer valid.

Note that depending on the harvesting interval many of the collection searches will most likely return less than 2000 hits and as such scrolling is not required the majority of times. However for consistency and to be able to handle any case we recommend using scrolling for many of these use cases as well.

Use Cases

As a harvesting client, I want to retrieve only the collection metadata which was revised after a given date.

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/collections?updated_since=2017-07-12T20:06:38.331Z&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"

As a harvesting client, I want to retrieve only the collection metadata which was newly added to the CMR after a given date.

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/collections?created_at=2017-07-12T20:06:38.331Z&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"

As a harvesting client, I want to identify the collection metadata which was deleted after a given date.

As a harvesting client, I want to tell which collections have added granules since the last time I harvested.

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/collections?has_granules_created_at=2017-07-12T20:06:38.331Z,&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"

As a harvesting client, I want to retrieve only the granule metadata which was newly added to the CMR after a given date.

Find collections which have newly added granules (same as the use case above "As a harvesting client, I want to tell which collections have added granules since the last time I harvested").
1. Initiate scroll session to capture collection-ids
```
curl -i "https://cmr.earthdata.nasa.gov/search/collections?has_granules_created_at=2017-07-12T20:06:38.331Z,&page_size=2000&scroll=true"
```
2. Scroll until no more results are returned
```
curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/collections?scroll=true"
```

Initiate scroll session to retrieve granule results (pass in the collection IDs captured from step 1).

curl -i -g "https://cmr.earthdata.nasa.gov/search/granules?created_at=2017-07-12T20:06:38.331Z&collection_concept_id[]=C1214471197-ASF&collection_concept_id[]=C1214470533-ASF&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/granules?scroll=true"

As a harvesting client, I want to retrieve only the granule metadata which was revised after a given date.

Currently there is not a way to check for collections which have had granule metadata created or revised after a given date (only newly created). The client will need to provide either a list of collection concept-ids or a provider id to limit the search against.

Example with provider-id of ASF:

Initiate scroll session

curl -i "https://cmr.earthdata.nasa.gov/search/granules?updated_since=2017-07-12T20:06:38.331Z&provider=ASF&page_size=2000&scroll=true"

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/granules?scroll=true"

Repeat the steps for each provider the client is harvesting.

Example with a list of collection concept IDs:

Initiate scroll session passing in all of the collection concept IDs the client is interested in. Note that the CMR will also accept a POST with a Content-type of application/x-www-form-urlencoded if there are a large number of collection concept IDs to pass in.
```
curl -i "https://cmr.earthdata.nasa.gov/search/granules?updated_since=2017-07-12T20:06:38.331Z&collection_concept_id[]=C1214471197-ASF&collection_concept_id[]=C1214470533-ASF&page_size=2000&scroll=true"
```

Scroll until no more results are returned

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/granules?scroll=true"

As a harvesting client, I want an Atom feed of new and updated granules.

Identical to the use case above "As a harvesting client, I want to retrieve only the granule metadata which was revised after a given date" except the calls to the granules endpoints should use the extension .atom.

For example:

curl -i -H "CMR-Scroll-Id: <scroll id>" "https://cmr.earthdata.nasa.gov/search/granules.atom?scroll=true"

As a harvesting client, I want to identify the granule metadata which was deleted after a given date.

Support to be added with CMR-4300 - Getting issue details... STATUS

Synchronizing External Systems with the CMR

Data providers have use cases which involve ensuring an external system is synchronized with the CMR.

Page tree

CMR Harvesting Best Practices

CMR Search Basics

Scrolling

Example

Populating External Systems

Use Cases

As a harvesting client, I want to retrieve all collection metadata.

As a harvesting client, I want to retrieve all collection metadata based on a given tag (CWIC, FedEO)

As a harvesting client, I want to retrieve all granule metadata for a given collection

Capturing Inventory Changes

Use Cases

As a harvesting client, I want to retrieve only the collection metadata which was revised after a given date.

As a harvesting client, I want to retrieve only the collection metadata which was newly added to the CMR after a given date.

As a harvesting client, I want to identify the collection metadata which was deleted after a given date.

As a harvesting client, I want to tell which collections have added granules since the last time I harvested.

As a harvesting client, I want to retrieve only the granule metadata which was newly added to the CMR after a given date.

As a harvesting client, I want to retrieve only the granule metadata which was revised after a given date.

As a harvesting client, I want an Atom feed of new and updated granules.

As a harvesting client, I want to identify the granule metadata which was deleted after a given date.

Synchronizing External Systems with the CMR