Specification Workshops
CMR-5446 - Spec Workshop for Agreement on New UMM-Var Attributes (SES for SDPS)
Across all teams affected within Data Use Train: (SES, MDQ, CMR, MMT, UVG)
CMR-5482 - EDSC-SES Agreement on SES call for SDPS Size Estimate (SES for SDPS)
Agreement is between CMR/Service-Bridge/SES and ESDC as the principle client.
As of PI-18.4:
https://cmr.sit.earthdata.nasa.gov/service-bridge/docs/current/rest-api#examples
GET: .../service-bridge/size-estimate/collection/<concept-id>
i.e. collection-id is a "parameter", but is part of the URL that precedes the query string
Note that, apparently, a short-name+vers-id is accepted in lieu of concept-id, though I'm not sure how well tested this option is.
?granules=<> which is comma-separated list for two or more granules,
or &granules[]=<>
, which is a repeated parameter for each granule in the query string
&variables=<>
which is a comma-separated list for two or more variables,
or &variables[]=<>
, which is a repeated parameter for each variable in the query string- &
format=<>
parameter is optional; if not provided, the format of nc
(NetCDF3) is assumed- (see proposed change to this default below)
- Currently supported
format
values are:
&total-granule-input-bytes=<>
- this is required when the format selected is compressed, such as
nc4
. - This parameter represents the size of the granules when no subsetting operation is being peformed.
Proposed:
- : &variable_aliases=a1,a2...
- Primarily to accommodate SDPS service handling where the Echo Form returns variable aliases and the variable-concept IDs are not known.
- Either &variables or &variable_aliases may be specified. Or, I imagine, - though this is not likely from e.g. EDSC. I believe one or the other is required.
- &service_id=<concept_id>
for the service record describing the data processing service (will have the service-type field, either opendap or esi).
- so that we can clearly identify the request type,
- and both kind of requests (service_types) can make use of concept-id/aliases.
- Default is current behavior, i.e., opendap
but it is expected EDSC will incorporate in SES calls in all cases.
- &format=<> parameter accepts new values:
- &format=<> accepts new values that match previously accepted formats
CMR-5447 - SES agreement on Automated Metrics approach (OPeNDAP Log Evaluation for SES Metrics)
- Feature and Objective: Automate Reporting Metrics on Estimates vs. Actuals
- Incorporate into Service - i.e. collect metrics as you go (if possible)
- Modular, Extensible implementation built around operational service
- Use actual Data Request Service data
- Focus on OPeNDAP Service
- However -
- SES is not yet operational, and End-to-End Services are only marginally operational,
thus getting actual and operational results this PI is challenging. - Coordinating with OPeNDAP Service providers for actual return data sizes is challenging.
- Initial Proposed Approach - Forward Evaluation of Metrics
Forward in the sense of starting from requests for size estimation and finding corresponding actuals for metrics - SES feature to upload "Actuals" data and write to SES log
- from OPeNDAP logs, but could evolve over time.
- SES Update to log additional data - request parameters, size estimate,
- possibly also - OPeNDAP URLs using OUS.
- External feature to process logs into a report - a Splunk script.
- Alternative Approach - Reverse Evaluation of Metrics
Reverse in the sense of starting from Actuals data (logs) and deriving estimates to match - Add a functional feature to scan an OPeNDAP log to determine OPeNDAP URLs processed
- read log, determine estimates, write report.
- Reverse Engineer one or more SES requests that map to the OPeNDAP URLs found, in order to request an estimate for the given actuals.
- Note that OPeNDAP URLs are at the “Job” level, per granule, and do not reflect a request which spans multiple granules. Size Estimation is on the basis of requests.
- Note that OPeNDAP URLs may contain array subsetting requests ([]) for each variable. (Ultimately not yet relevant as SES does not support such requests).
- Use of reverse evaluation requires CMR data, including UMM-Var for SES to evaluate estimates.
- problems of populating production data in CMR? or correlating production data against SIT data?
- How to collect “Actuals”
- Original plan is to use OPeNDAP logs
- working with OPeNDAP/Hyrax team to revise logging to facilitate metrics reporting
- working with tomcat/apache logs may suffice in the short term.
- GES-DISC has reported that logs are deleted after reporting to EMS
- so back data is not available, but data going forwards might be.
- Can be demonstrated and evaluated using SDPS-EDF environment (OPeNDAP services).
- Issues:
- Are we simply trying to get more evaluational data for the current implementation of SES?
- Do we need evaluation based upon "real" data sooner rather than later
- We can provide better evaluational results from SDPS, but not the same large scale as e.g. GES-DISC.
Assuming we can resolve reverse-evaluation challenges.
- Spatial and Temporal Subsetting are not yet considered in our estimation efforts.
CMR-5448 - SES agreement on Machine Learning approach
- Per Granule evaluation with Summation
- Vs. Per Request evalution - with e.g. 1000 granules
- Per Request with large orders IS THE TARGET!
- However -
- Per Granule feature options are much richer, including clearly relevant features
- Per Granule-Set has very few features to consider (spatial/temporal request parameters)
- May have to prove performance is acceptable for e.g., 2000 granule requests.
- Collection-Specific evaluation
- Evaluating Collection-Specific Features first, i.e. features that appear relevant and useful for a specific target collection but may not apply to the next collection of interest or to all collections,
- Can generalize later
- Issue is ease with which DAACs can bring a new collection, with services, into the system
- Collection-Specific set of features and ML approach will complicate this
- Again, a generic collection approach lacks a rich set of features for evaluation
- Avoiding Spatial/Temporal aspects of the request for now
- Using Spatial/Temporal aspects of selected granules, e.g. h- & v- grid location of granules.
- Feature definition is specific to instrument/science-data being captured.
- e.g. separate regional definitions for snow, ice, soil-moisture, sea-specific, land-specific
- Choice of (clustering | classification | linear-regression | ... ) approach
- Local Efforts vs. GHRC Efforts (others?)