Help with document scoring

Created by Daine Wright, last modified by Anonymous on Apr 15, 2017

I'm exploring the CMR Search API. I'm using a keyword search restricting data to ORNL DAAC data, returning results in json format.

All of the entries have a score = 0.5. Am I using the query wrong? Or is there a deeper problem?

Here's the query I'm using:

https://cmr.earthdata.nasa.gov/search/collections.json?keyword=above&data_center=ORNL_DAAC&sort_key[]=%2Bscore&sort_key[]=-start_date&pretty=true&offset=0&page_size=10

Here is the first entry as an example:

    "entry" : [ {
      "time_start" : "2014-08-10T00:00:00.000Z",
      "score" : 0.5,
      "boxes" : [ "68.4 -149.5 69.15 -148.7" ],
      "online_access_flag" : true,
      "id" : "C1000000540-ORNL_DAAC",
      "browse_flag" : false,
      "summary" : "This data set includes estimates of permafrost Active Layer Thickness (ALT; cm), and calculated uncertainties, derived using a ground-penetrating radar (GPR) system in the field in August 2014 near Toolik Lake and Happy Valley on the North Slope of Alaska. GPR measurements were taken along 10 transects of varying length (approx. 1 to 7 km). Traditional ALT estimates from mechanical probing every 100 to 500 m along each transect are also included. These data are suitable for future studies of how ALT varies over relatively large geological features, such as hills and valleys, wetland areas, and drained lake basins.",
      "time_end" : "2014-08-15T00:00:00.000Z",
      "coordinate_system" : "CARTESIAN",
      "original_format" : "ECHO10",
      "processing_level_id" : "3",
      "data_center" : "ORNL_DAAC",
      "archive_center" : "ORNL_DAAC",
      "links" : [ {
        "rel" : "http://esipfed.org/ns/fedsearch/1.1/data#",
        "hreflang" : "en-US",
        "href" : "https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1265"
      }, {
        "rel" : "http://esipfed.org/ns/fedsearch/1.1/metadata#",
        "hreflang" : "en-US",
        "href" : "https://daac.ornl.gov/ABOVE/guides/ReSALT_ALT_GPR.html"
      } ],
      "dataset_id" : "Pre-ABoVE: Ground-penetrating Radar Measurements of ALT on the Alaska North Slope",
      "title" : "Pre-ABoVE: Ground-penetrating Radar Measurements of ALT on the Alaska North Slope",
      "organizations" : [ "ORNL_DAAC" ],
      "short_name" : "doi:10.3334/ORNLDAAC/1265",
      "updated" : "2015-09-25T00:00:00.000Z",
      "orbit_parameters" : { },
      "version_id" : "1"
    },

No labels

14 Comments

user-7b92a
The score is based on how well the document matches your keywords "above". The score would be the same if all the documents contain "above" in all of the same fields. If you change your keyword search to "above terra" then some documents come back with different scores. Some of them would have terra in more fields than others so they get a higher score.
That's an explanation of how it works now. Search relevancy is an area we'd like to improve. Feedback is important to improve that. In the ORNL collections when searching with "above" would you expect some collections have a higher relevancy than others? If so would you mind identifying a couple which should have higher relevancy and some with lower relevancy?
- Permalink
- Feb 07, 2017
Daine Wright
Just to complicate things, ABoVE is a project. I would hope any project matches for 'above' to be weighted higher than a match in the summary (or other field).
My issue is that all the scores are 0.5 regardless of the keyword used in the query. Surely all the records don't match equally well. If they do all match equally well, search relevancy needs more than just improvement.

Here is another example using another project name (LBA) that isn't a common word: https://cmr.earthdata.nasa.gov/search/collections.json?keyword=lba&data_center=ORNL_DAAC&pretty=true&offset=0&page_size=10
- Permalink
- Feb 07, 2017
Daine Wright
Here is a list of ABoVE datasets that I expect should have high relevancy to the 'above' search: http://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=34
There are 18 collections. I expect those 18 collections to be at the top of the list of 140 results.

And for completeness, here is a list of LBA datasets: http://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=11
- Permalink
- Feb 07, 2017
1. user-7b92a
  I think this is a case where a change to the metadata could improve things. The word "above" is a little generic so it shows up in datasets that are unrelated to the project. like this CARVE collection: https://cmr.earthdata.nasa.gov:443/search/concepts/C1000000520-ORNL_DAAC/13
  I looked at the metadata for the ABOVE collections and the campaigns have "Arctic-Boreal Vulnerability Experiment" as the short name and long name. If the short name were changed to "ABOVE" then the CMR would find it in that field and give it a boost because the project matches. I'll see if I can contact someone about making that potential change.
  Permalink
  
  Feb 07, 2017
  1. Daine Wright
    We are working on an effort to improve the metadata. If that's the issue, we can update it on our end.
    But the scores are all 0.5 using the "Arctic-Boreal Vulnerability Experiment" as the keyword.
    https://cmr.earthdata.nasa.gov/search/collections.json?keyword=Arctic-Boreal%20Vulnerability%20Experiment&data_center=ORNL_DAAC&pretty=true&offset=0&page_size=10
    
    Permalink
    
    Feb 07, 2017
Doug Newman
Ben McMurry - Is user-7b92a's suggestion an option for ORNL?
- Permalink
- Feb 07, 2017
Ben McMurry
Yes, it is an option. Not trivial, but definitely doable. But I think if we are going to do that we want to understand the impact on relevancy ahead of time. It's not something where we could afford to do a lot of trial and error.
- Permalink
- Feb 07, 2017
Ben McMurry
In my last comment I was referring to metadata changes in general, not just changing the short name. We intend to make metadata changes in the future to better describe our data, and changing things to improve relevancy at the same time makes sense. But I wouldn't want to get into a habit of making lots of changes over time to chase a high relevancy score.
- Permalink
- Feb 07, 2017
1. user-7b92a
  I agree with you. I wouldn't want to have you "chasing a high relevancy score". This case is updating the metadata to reflect the terms in use (ABOVE is the project name) and enable users to find the collections most relevant to them.
  Permalink
  
  Feb 07, 2017
Daine Wright
The scores are all 0.5.
Something has to be wrong other than metadata.
- Permalink
- Feb 13, 2017
1. user-7b92a
  Yes, improving our scoring algorithm is something that we want to improve in the CMR. These collections contain the word above in the following fields (using the UMM name)
  Entry Title (Aka Dataset Id)
  Abstract (aka description)
  Related URL Description
  Not all of them contain "above" in all of those fields. We don't have a score boost configured for any of those fields so if it has the term in the field it doesn't impact the final score. Other fields like project short name are boosted now. If a document contains a matching term in the project short name then it will have a higher score than one that does not.
  We should definitely improve the scoring on the CMR side. For now, in this specific case, the metadata improvement of changing the project short name "Arctic-Boreal Vulnerability Experiment" to "ABOVE" will help this one specific case and seems like a good metadata improvement to make.
  Permalink
  
  Feb 13, 2017
  1. Daine Wright
    Then shouldn't a search for keyword=Arctic-Boreal Vulnerability Experiment should have improved scores. But that's not what I am getting:
    https://cmr.earthdata.nasa.gov/search/collections.json?keyword=Arctic-Boreal%20Vulnerability%20Experiment&data_center=ORNL_DAAC&pretty=true&offset=0&page_size=10
    
    Permalink
    
    Feb 13, 2017
    1. user-7b92a
      
      It should improve the scores and you're right that it doesn't. I took a look at what's going on but I'm unable to determine why that's happening. I filed a CMR bug so that we can look into that. CMR-3823 - Getting issue details... STATUS
      
      Permalink
      
      Feb 13, 2017
      1. Daine Wright
        
        Thanks for putting that in. Is there a way for me to check the status of the bug?
        
        Permalink
        
        Apr 04, 2017

Page tree

14 Comments

user-7b92a

user-7b92a

user-7b92a

user-7b92a

user-7b92a