What We're Trying To Do

The IceBridge Portal shows a polar stereographic projection of either the northern or southern hemisphere, depending on the user's interest. For the selected hemisphere, we show a list of collections that are (a) IceBridge datasets, and (b) have a bounding box that intersects with the hemisphere they are viewing (e.g., for the northern hemisphere, a bounding box of [-180, 0, 180, 90]). For each collection, we show two counts: (i) the number of granules that match the user's current temporal and spatial filters, and (ii) the total number of granules in the hemisphere for that collection. So it looks something like this if the user hasn't set any temporal or spatial filters:

DatasetGranules in constraintGranules in hemisphere
IAKST1B Version 001
123123
IDCSI4 Version 001
4646
IDHDT4 Version 001
204

204

When they change their temporal or spatial filters, we'd like to update the first count--the count showing the number of granules for those filters, e.g.:

DatasetGranules in constraintGranules in hemisphere
IAKST1B Version 001
13123
IDCSI4 Version 001
046
IDHDT4 Version 001
93

204

Along with this list view, there is a map which displays granules that they've selected to view. For now, I'm not interested in the granule queries we're doing against CMR.

What We're Doing

Currently we're doing something bad and inefficient in IceBridge Portal. This was an initial attempt to get feedback to the user more quickly so they could see a more responsive interface. So basically I'm apologizing in advance for what I'm about to say  .

We do three types of queries to populate the list with counts shown above:

  • To get the list of collections, we do the following CMR query  once  when the user first arrives at the portal:
https://cmr.earthdata.nasa.gov/search/collections.json?keyword=icebridge&page_size=100&temporal=2009-01-01T07:00:00.000Z,2016-01-28T16:19:34.258Z&bounding_box=-180,0,180,90
  • To get the  total  number of granules for each collection in the hemisphere the user is currently viewing, we issue a  separate query  to CMR like the following:
https://cmr.earthdata.nasa.gov/search/collections.json?keyword=icebridge&page_size=100&bounding_box=-180,0,180,90&include_granule_counts=true&concept_id=C1000000341-NSIDC_ECS

This gets the number of granules for the entire northern hemisphere for one specific collection. So here's the bad part: we issue this query for each collection returned from query #1 above (!). For IceBridge, this is 50-60 queries.

Whenever a user changes temporal or spatial filters, we again issue query #2, but with their temporal and spatial filters, e.g.:

https://cmr.earthdata.nasa.gov/search/collections.json?keyword=icebridge&page_size=100&temporal=2009-01-01T07:00:00.000Z,2016-02-01T21:15:36.639Z&polygon=-54.28856753366417,70.20590793535574,-53.76512809420709,69.05817167534093,-51.035282396612224,69.18466321053025,-51.39895662736889,70.340404805645,-54.28856753366417,70.20590793535574&include_granule_counts=true&concept_id=C1000000180-NSIDC_ECS

Again, we know this is bad, but we do a separate query for each collection in their list (~50-60 collections / queries). So doing queries 2 and 3, even though we issue lots of queries, the users see results coming back immediately, rather than waiting 5 seconds to get all the results.

What We'd Like To Do

  • Issue one  CMR query to get a list of IceBridge collections in the hemisphere, along with the granule counts. E.g.:
https://cmr.earthdata.nasa.gov/search/collections.json?keyword=icebridge&page_size=100&temporal=2009-01-01T07:00:00.000Z,2016-01-28T16:19:34.258Z&bounding_box=-180,0,180,90&include_granule_counts=true

(just query #1 with granule counts) We did start with this, but the query was taking 10-20s at the time, IIRC. That's why we switched to the set of queries shown above. It seems now that this query runs quite a bit faster than it did. So it's quite possible that we could switch back to it (see below).

When the user changes their filters, issue  one  CMR query to get a list of IceBridge collections with a specific temporal and spatial filter, along with matching granule counts. E.g.: 

https://cmr.earthdata.nasa.gov/search/collections.json?keyword=icebridge&page_size=100&temporal=2009-01-01T07:00:00.000Z,2016-02-01T21:15:36.639Z&polygon=-54.28856753366417,70.20590793535574,-53.76512809420709,69.05817167534093,-51.035282396612224,69.18466321053025,-51.39895662736889,70.340404805645,-54.28856753366417,70.20590793535574&include_granule_counts=true

Questions

  1. Are there more efficient ways of issuing these queries to CMR (e.g., other parameters, options, etc)?
  2. Are there optimizations that you can do in CMR that would improve the performance of these queries?
  3. Are there other ways of slicing and dicing this problem that you can think of?
  • No labels

13 Comments

  1. I executed the proposed collections query above a few times and found it taking between 2.5 seconds and 3 seconds with just about the entire time spent in the Elasticsearch query against the granule indexes. When I removed the spatial portion of the query it completes in under 200 ms, but I see you want to split it up by hemisphere. I'm having trouble coming up with something clever to effectively do the same thing.

    I find it interesting that issuing 51 separate queries for each of the 51 collections comes back faster that the single query. Are you issuing them all in parallel? Is it faster because the portal is starting to display results as soon as the first query finishes? We could look into whether internally we could improve performance by executing multiple queries.

    One change I would make to the proposed query is to add provider parameters (provider[]=NSIDC_ECS&provider[]=NSIDCV0) to guarantee you are limiting the search to only NSIDC providers. Since the keyword query for 'icebridge' is already effectively limiting the providers this does not yield a performance improvement currently. However I did find 2 collections in the SCIOPS provider which match with the keyword 'icebridge' when I dropped the spatial and temporal constraints, which I'm not sure you want to show in the portal. Also there is nothing preventing another provider from including 'icebridge' in a field like the summary in the future, which would make performance worse as additional indexes would need to be searched against.

    1. Thanks Chris.

      Yes, we're essentially executing the queries in parallel to the extent that the browser can (~6 concurrently). So the main advantage of doing this is getting quicker feedback in the browser that the portal is "doing something" based on the user changing their filters. The total elapsed time may in fact be longer, though I haven't done any controlled benchmarks.

      As I noted, I think these queries were running in the 10-15 second range, and they are now running faster than I remember them running, but I don't have any hard data on that. I think you're right that they are now in the 3 second range, though one day I ran them and I was getting in the 5-10 second range. It would be good if we could get consistent performance from them, if possible.

      Here's an idea I thought of based on your last paragraph: would there be a performance improvement by creating a tag for IceBridge datasets, and then removing our `keyword=icebridge` query parameter; replacing it with something like `?tag_namespace=org.nsidc&tag_value=icebridge`? I haven't used tags at all, but I assume there are good indexes created based on the tag namespace and values? I will try this if you think it might be a possible performance improvement.

      1. user-7b92a

        We made some spatial search performance improvements a few weeks ago so the 10-15 second time range going down to 3 seconds sounds about right. I don't think tagging will improve things as I think the bottleneck is most likely the spatial portion of the search. By the time the spatial portion of the search runs the keywords have already been used to identify the exact collections to search against.

  2. user-7b92a

    I think this is something we should investigate further to see if we can improve performance for these kinds of queries. I filed a new JIRA Issue  CMR-2416 - Getting issue details... STATUS

    1. Thanks for creating that issue. I'm not able to access it though. Should I be able to?

      1. user-7b92a

        I didn't know who originally sent this to Doug. I added you to the issue now so you should be able to see it.

        1. I still don't have permission for some reason. I get redirected to the Earthdata Service Desk search page.

          1. For what it's worth, I asked Nemo in the SDPS PRB today if he could follow up to allow broader access to CMR-2416.  As I, and I suspect other developers can't access either.

          2. user-7b92a

            Can you make sure that you're logged in as kbeam in URS and recheck one more time? I'll contact our help desk if it still doesn't work.

            1. Verified, and still unable to access.

  3. user-7b92a

    Kevin Beam, Sorry that I missed your message about still being unable to access the issue. I've changed permissions and this will hopefully give you access now. I've put notes on the issue regarding testing and changes we've made. Here's quick summary:

    • I remeasured performance of the these searches and they are much better than you originally indicated. We've done some things to improve performance since then. Performance is still not at the level we want though.
    • We're currently testing a fix for rebalancing on Elasticsearch indexes that should help performance of these searches along with many other searches. That should roll out to Ops in a couple weeks. 
    • I used sample Ice Bridge granules to optimize our spatial intersection algorithms which run as a plugin in Elasticsearch. I was able to achieve an improvement of 33% reduction in spatial intersection times. That should be deployed in a few weeks to our operational system.

    I'd encourage you to test our current performance now to see if it's acceptable while also knowing that we'll be pushing changes out in a few weeks to make it even better.

    I also have some ideas for additional performance improvements. One is  CMR-2674 - Getting issue details... STATUS  (a compiler improvement) which will potentially get an additional 13% performance increase. Another is  CMR-2681 - Getting issue details... STATUS which removes redundant points during spatial indexing (but does not modify the metadata) to reduce the number of calculations required while maintaining spatial accuracy. We'll have to prioritize those among future work. CMR-2674 is very likely go into a future sprint soon. CMR-2681 is a bit more work so that might be a bit longer.

     

    1. Jason, thank you for the updates and work on this. Subjectively, the queries do seem faster. We've also changed some of our queries to using the granule timeline search query, which has improved UI responsiveness since that query is quite fast. I look forward to other improvements that are rolling out or you're working on. Keep up the good work, we appreciate it.

      As a note, sometime in the next month or two, I think I'll be doing a talk on our use of the CMR API at a UTC meeting. We haven't finalized a date yet, but just wanted to let you know that may happen at some point soon. I think I'll be going into details on the queries we're running, how we're using that data in the IceBridge Portal, and related topics.

      1. user-7b92a

        Let me know when you're speaking at the UTC. I'd like to attend that one.