If you have any useful posted documentation on using CMR to do Granule Reconciliation, please point me to it.  If anyone (CMR or Providers) have opinions about indexing LastUpdate or ideas on how to reconcile granules vs CMR, please chime in.

I'd like to reconcile my provider granule records against CMR using a CMR (REST) interface.  In the past I reconciled LAADS provider holdings vs ECHO using DataManagementService API GetDatasetInformation (WSDL+FTP) for a specified collection and LastUpdate time range covering a few days to a few weeks at a time.  In my database I have a table of products sent to ECHO that is indexed both by GranuleUR and collection+LastUpdate.  At the moment LastUpdate is not searchable in CMR. I'm told it could be indexed and made available for search, but whether that is worth doing might depend on whether it is useful to others or not.  With FTP ingest it was necessary to reconcile recently sent metadata once, making the LastUpdate time condition critically useful; but with REST I can skip that as I assume if CMR gives me an accept message the granule metadata won't be dropped after that.

The most obvious alternative might be to search by revision_date.  Under normal circumstances with REST that is presumably slightly greater than the LastUpdate I supply which is my time when I gather metadata to send for about 10 products at a time.  (That time is useful for me to record because for example if I get an associated browse product after that time, I queue the data product to be resent in 15 minutes with whatever browse URLs are then available.)  However the REST site might be unavailable for 15 minutes or take minutes to respond as I've seen ECHO REST do, there are always small clock skews, and until last week all LAADS granules were ingested via ECHO FTP which sometimes took days.  All of that makes comparison of LastUpdate to revision_date difficult.  While I might be able to add revision_date to my table, I don't know if it even gets returned in a granule accept message or would require a separate query.  What I like about LastUpdate (vs revision_date) is that I supply it and you record it, so we both agree on the value.

The second obvious alternative might be to search by GranuleUR range.  However in my system I store the numeric part of my GranuleUR and CMR search would consider it a character field.  I could possibly create a functional covert to varchar index on my (Postgresql) table to make the search useful - that may be the easiest solution under my control assuming I only care about full reconciliation and not reconciliation of metadata sent in a specified (e.g. recent) time range.  (Our reprocessing usually reuses old GranuleURs, so our GranuleUR has no relation to any range of LastUpdate.)

Finally as an aside I'm interested in some idea about how indexing works in your system.  Specifically if I query on specified collection/entry_id (short_name + version) and a range of say revision_date, does your system use composite indexing (entry_id + revision_date), would it only use the most useful of those indexes or do some sort of bit map join, or are all search conditions in some way useful as indexes rather than just filters?  Your CMR Client Partner User Guide#BestPracticesforQueries seems to imply that specifying collection + range is helpful - does that mean your searchable attributes are indexed across all collections and within each collection, or just within each collection for some attributes?  Is a separate collection + GranuleUR range query for each of my collections better than a single query of provider + GranuleUR range?

 

  • No labels

6 Comments

  1. user-7b92a

    I'd like to know what your goals are with reconciliation. Depending on your goals there are different solutions that would work. There are different kinds of problems that can happen and different guards that we can put in place to prevent them. Do you want to know if recent data has made it to the CMR? It sounds like from your statement "With FTP ingest it was necessary to reconcile recently sent metadata once, ... but with REST I can skip that as I assume ... the granule metadata won't be dropped after that." that that's not something you're worried about. There's different levels of assurance that two systems are in sync.

    - All of the data matches bit for bit
    - All of the ids of items match
    - Recent data matches bit for bit
    - Recent data ids match
    - Some sort of incremented version counter is the same on all items.

    Some of those are much harder than others.

    We index our data in ElasticSearch. Each collection's granules are put into a separate index except for smaller collections which are grouped together in one big index. Putting a collection id in your query will target a single index in ElasticSearch which is a much smaller set of data to query over. If you use provider id instead of a collection id it will still choose a subset of all of the indexes but it won't be as fast.

    We have a similar sort of synchronization issue between our relational database and the ElasticSearch index. We need to make sure they are in sync. We do this by saving every update as a new immutable revision with a monotonically increasing revision id. Every new revision is indexed asynchronously using a message queue with retries on any failure. The revision id makes sure that older data doesn't accidentally overwrite newer data in our search index. Because each revision is immutable we can index it multiple times. We don't need any kind of locking.

    You can specify a revision id on Ingest and will respect that. If you ingest a granule with revision id 2 and later the same granule (same native id) with a revision id of 1 it won't overwrite revision 2. Depending on your goals for reconciliation and problems you're trying to avoid that might be one solution.

    1. user-d1d1d

      Unless you tell me I should worry about reconciling recent granules, I assume that is not necessary if REST returned an accept code when they were submitted.

      Otherwise, maybe once every few months, I want to check every granule in a specified collection (with LastUpdate not too recent) and delete any GranuleUR in CMR that is not in my records as supposed to be there, and ingest into CMR any GranuleUR that should be there and is not, or that does not have the LastUpdate time that I have in my records.  If LastUpdate time matches I'm willing to assume all other metadata matches.  If I went beyond that I might double check browse URLs and possibly temporal startdate/time, but I don't consider that necessary.

      1. user-7b92a

        That's right that you don't need to worry about data being saved in the CMR if you received an accept code from the REST API.

        Retrieving the granule ur and last update of every granule in a collection would be a new capability within the CMR since we're talking about millions of records. If we could sort by granule ur and had a way to return just granule ur and last update then we could do it through our REST API as a new search response format. We currently only allow returning 2000 results. We'd have to increase this to allow iterating through all of the granules in a collection. For each request you'd take the granule ur of the last granule in the last call and ask for records greater than that. That would allow paging through all the granules in a collection. The max page size would be somewhere between 10,000 and 100,000 based on a reasonable amount of data to return in a request. You could go through a collection with 10 million granules with 100 requests downloading of 100K granules.

        The API would look something like this:

        GET https://cmr.earthdata.nasa.gov/search/granules?concept_id=C1-LAADS&sort_by=granule_ur&granule_ur=SOME_GRAN:12345,
        Accept: application/vnd.nasa.cmr.reconcile+json
        [["SOME_GRAN:12346","2012-01-02T00:00:00Z"],["SOME_GRAN:12347","2012-01-02T00:00:00Z"]...]

         

        Do you think something like this would work for your needs? We'd need to consider the reconciliation needs of everyone before we go build this. We're trying to make sure that we build reusable APIs that work for everyone and avoid specific one off solutions. This reconciliation format might be one of those cases.

         

         

  2. user-d1d1d

    The API (URL) you suggest would fit the GranuleUR approach I mentioned above and would work for reconciling an entire collection.  (I'd assumed granule_ur range worked already, but it doesn't, although sort_key=granule_ur does.)  Sorting and paging through LastUpdate within a collection would also work.  I think you need something in there to specify the limited metadata in the result set like ".../granules.rec-echo10?..." or ".../granules?attrs=granule_ur,last_update&...".  A means of specifying which metadata attributes should be in results, possibly in some shortened format, could be generally useful.  A result set of 2000 at a time is fine - my script would stick those into a temp table for sql comparisons to my ECHO_Files table and would want the size to be somewhere between 1000 and 10000 max anyway.  I don't mind making a several thousand calls/iterations.

    Please do talk to other providers to see what they want, and then eventually add a "Granule Reconciliation" section to your data provider guide documentation.

  3. I had exactly this same issue. I need to periodically reconcile our full dataset against CMR, looking at granule_ur and last_update. I ended up doing it in 2 phases. One iterates through the source and checks to see if the granules are in CMR. Without checking my code, I believe I am also comparing last update date. Granules are added/updated in CMR as needed. The other phase reads all the granule_urs from CMR. It checks to see if they're in the source, and if not, deletes them from CMR. Getting everything from CMR was tricky as we also navigate our source dataset by last_update date, so that's what I had hoped to do, but couldn't. Because of the 2000 result limit, I ended up iterating through CMR using the temporal range, incrementing the beginning date (to be the last value from the previous call) each time. This returned some duplicate concept-id's so I also excluded the last one of those from the previous call. I don't think it totally eliminated duplicate results but because of how I am using the results, a few duplicates are harmless.

    https://cmr.uat.earthdata.nasa.gov/search/granules.echo10?page_size=400&page_num=1&short_name=Landsat7_ETM_Plus&version=1&provider=USGS_EROS&temporal=1950-01-01T00:00:00Z,2999-12-31T00:00:00Z&exclude[concept-id][]=G0000000-USGS_EROS&sort_key[]=start_date

    https://cmr.uat.earthdata.nasa.gov/search/granules.echo10?page_size=400&page_num=1&short_name=Landsat7_ETM_Plus&version=1&provider=USGS_EROS&temporal=1999-11-12T08:45:28.657Z,2999-12-31T00:00:00Z&exclude[concept-id][]=G1215427498-USGS_EROS&sort_key[]=start_date

    https://cmr.uat.earthdata.nasa.gov/search/granules.echo10?page_size=400&page_num=1&short_name=Landsat7_ETM_Plus&version=1&provider=USGS_EROS&temporal=1999-11-26T20:23:09.391Z,2999-12-31T00:00:00Z&exclude[concept-id][]=G3169687-USGS_EROS&sort_key[]=start_date

  4. Thanks Lisa.  When I saw your comment early this summer I realized that giving up on LastUpdateTime and using data time was a key insight.  Not too long after I saw your comment, EOSDIS released a scroll feature: CMR Harvesting Best Practices, which could be used to eliminate duplicates, although that's actually fairly minor.  Months later I've finally had a few days to work on this, and I have something that seems to work.  I'm using a single pass, with batches ordered by start_date, and I do my comparisons of a cmr batch vs my database for batch.min(start_date) <= my_db.start_date < batch.max(start_date) and retain all granules with max(start_date) in the next batch since granules with the same start_date might be split into consecutive batches.  And check my db for granules before the first batch.min(start_date) seen, and after the last batch.max(start_date).  (I hope to add a feature to check CMR vs granules I send to other DAACs(Providers) which they send to CMR, but time comparisons would be trickier due to the way I index time, so I'd have to retain more CMR batch data to loosely match start_time, or do two passes.)

    In case anyone has any interest, here's an example of my CMR queries in curl form for all granules in one collection:

    curl -i "https://cmr.earthdata.nasa.gov/search/granules.json?scroll=true&provider_short_name=LAADS&short_name=MOD03&version=6&sort_key=start_date&page_size=2000"

    followed by multiple calls to (assuming 2114772067 is the scroll-id value in the header of the response to above call):

    curl -i -H "CMR-Scroll-Id: 2114772067" "https://cmr.earthdata.nasa.gov/search/granules.json?scroll=true