Size Estimate Domain

Specification Workshops

CMR-5446 - Spec Workshop for Agreement on New UMM-Var Attributes (SES for SDPS)

Across all teams affected within Data Use Train: (SES, MDQ, CMR, MMT, UVG)

Data-Use Release Train Technical Tag-up held 1/28/2019, 2:00p

(Coordinated by Data-Use Program Manager - Kathy Carr)

Attended by David Auty and Siwei Xu from the SES team.

Concurrence was reached on new UMM-Var Version and the sequence of dependency and usage of UMM-Var Versions throughout the Data Use Train.

Link to UMM-Var Version 1.4: https://cdn.earthdata.nasa.gov/umm/variable/v1.4/

Link to Kathy's minutes: 2019-01-28 Meeting notes

Link to UMM Deployment Plan: 19.1 UMM-x Progression

CMR-5482 - EDSC-SES Agreement on SES call for SDPS Size Estimate (SES for SDPS)

Agreement is between CMR/Service-Bridge/SES and ESDC as the principle client.

As of PI-18.4:

https://cmr.sit.earthdata.nasa.gov/service-bridge/docs/current/rest-api#examples

GET: .../service-bridge/size-estimate/collection/<concept-id>
i.e. collection-id is a "parameter", but is part of the URL that precedes the query string
- Note that, apparently, a short-name+vers-id is accepted in lieu of concept-id, though I'm not sure how well tested this option is.
?granules=<> which is comma-separated list for two or more granules, or &granules[]=<>, which is a repeated parameter for each granule in the query string
&variables=<> which is a comma-separated list for two or more variables,
or &variables[]=<>, which is a repeated parameter for each variable in the query string
&format=<> parameter is optional; if not provided, the format of nc (NetCDF3) is assumed
- (see proposed change to this default below)
Currently supported format values are:
- dods (binary)
- nc
- nc4
- ascii
&total-granule-input-bytes=<>
- this is required when the format selected is compressed, such as nc4.
- This parameter represents the size of the granules when no subsetting operation is being peformed.

Proposed:

Add a new, optional, API parameter: &variable_aliases=a1,a2...
- Primarily to accommodate SDPS service handling where the Echo Form returns variable aliases and the variable-concept IDs are not known.
Either &variables or &variable_aliases may be specified. Or, I imagine, both - though this is not likely from e.g. EDSC. I believe one or the other is required.
Also add a new, optional, parameter: &service_id=<concept_id>
for the service record describing the data processing service (will have the service-type field, either opendap or esi).
- so that we can clearly identify the request type,
- and both kind of requests (service_types) can make use of concept-id/aliases.
- Default is current behavior, i.e., opendap
  but it is expected EDSC will incorporate in SES calls in all cases.
&format=<> parameter accepts new values:
- (format comes from Echo Form for SDPS service handling)
  format maps-to (UMM-Var)
  shapefile esri_shapefile
  tabular_ascii ascii
  geotiff geotiff
  <null> native
- SES will map a given request format value to compression information within UMM-Var per format as shown in the maps-to column above.
- If not provided, default is dependent on service type:
  - OPeNDAP => nc,
  - ESI => null (no-change, native, no-reformat)
&format=<> accepts new values that match previously accepted formats

- (format comes from Echo Form for SDPS service handling)
  format maps-to (UMM-Var)
  NetCDF nc
  NetCDF4-CF nc4
  NetCDF4 nc4
  NetCDF-4 nc4
- SES will map a given request format value to compression information within UMM-Var per format as shown in the maps-to column above.

CMR-5447 - SES agreement on Automated Metrics approach (OPeNDAP Log Evaluation for SES Metrics)

Feb. 15, 3:00p

Following discussion on this feature, it was decided to cancel development at this time. See meeting notes here: 2019-02-16 - Specification Workshops on Automated Metrics and Machine Learning

Feature and Objective: Automate Reporting Metrics on Estimates vs. Actuals

Incorporate into Service - i.e. collect metrics as you go (if possible)

Modular, Extensible implementation built around operational service

Use actual Data Request Service data
Focus on OPeNDAP Service

However -

SES is not yet operational, and End-to-End Services are only marginally operational,
thus getting actual and operational results this PI is challenging.

Is this a future goal?

Coordinating with OPeNDAP Service providers for actual return data sizes is challenging.

Initial Proposed Approach - Forward Evaluation of Metrics
Forward in the sense of starting from requests for size estimation and finding corresponding actuals for metrics

SES feature to upload "Actuals" data and write to SES log

from OPeNDAP logs, but could evolve over time.

SES Update to log additional data - request parameters, size estimate,

possibly also - OPeNDAP URLs using OUS.

External feature to process logs into a report - a Splunk script.

Alternative Approach - Reverse Evaluation of Metrics
Reverse in the sense of starting from Actuals data (logs) and deriving estimates to match

Add a functional feature to scan an OPeNDAP log to determine OPeNDAP URLs processed
- read log, determine estimates, write report.
Reverse Engineer one or more SES requests that map to the OPeNDAP URLs found, in order to request an estimate for the given actuals.

Note that OPeNDAP URLs are at the “Job” level, per granule, and do not reflect a request which spans multiple granules. Size Estimation is on the basis of requests.
Note that OPeNDAP URLs may contain array subsetting requests ([]) for each variable. These may not map readily to ESDC level requests for services, with bbox and temporal subsetting specifications. (Ultimately not yet relevant as SES does not support such requests).

Use of reverse evaluation requires CMR data, including UMM-Var for SES to evaluate estimates.

problems of populating production data in CMR? or correlating production data against SIT data?

How to collect “Actuals”

Original plan is to use OPeNDAP logs

working with OPeNDAP/Hyrax team to revise logging to facilitate metrics reporting
working with tomcat/apache logs may suffice in the short term.
GES-DISC has reported that logs are deleted after reporting to EMS
- so back data is not available, but data going forwards might be.
Can be demonstrated and evaluated using SDPS-EDF environment (OPeNDAP services).

May need to switch to using EMS vs. OPeNDAP logs.

Issues:
- Is Reverse-Evaluation a "one-off", or a basis for on-going metrics evaluation
  - Are we simply trying to get more evaluational data for the current implementation of SES?
- Do we need evaluation based upon "real" data sooner rather than later
  - We can provide better evaluational results from SDPS, but not the same large scale as e.g. GES-DISC.
    Assuming we can resolve reverse-evaluation challenges.
- Spatial and Temporal Subsetting are not yet considered in our estimation efforts.

CMR-5448 - SES agreement on Machine Learning approach

Feb. 15, 3:00p

Following discussion on this feature, it was decided to evaluate the potential of this objective over a shorter, near-term timeframe, and potentially cancel development. See meeting notes here: 2019-02-16 - Specification Workshops on Automated Metrics and Machine Learning

Per Granule evaluation with Summation
- Vs. Per Request evalution - with e.g. 1000 granules
- Per Request with large orders IS THE TARGET!
- However -
  - Per Granule feature options are much richer, including clearly relevant features
  - Per Granule-Set has very few features to consider (spatial/temporal request parameters)
- May have to prove performance is acceptable for e.g., 2000 granule requests.
Collection-Specific evaluation
- Evaluating Collection-Specific Features first, i.e. features that appear relevant and useful for a specific target collection but may not apply to the next collection of interest or to all collections,
- Can generalize later
- Issue is ease with which DAACs can bring a new collection, with services, into the system
  - Collection-Specific set of features and ML approach will complicate this
- Again, a generic collection approach lacks a rich set of features for evaluation
Avoiding Spatial/Temporal aspects of the request for now
- Using Spatial/Temporal aspects of selected granules, e.g. h- & v- grid location of granules.
- Feature definition is specific to instrument/science-data being captured.
  - e.g. separate regional definitions for snow, ice, soil-moisture, sea-specific, land-specific
Choice of (clustering | classification | linear-regression | ... ) approach
Local Efforts vs. GHRC Efforts (others?)
- Collaboration?
- Review?

Space shortcuts

Page tree

Specification Workshops

CMR-5446 - Spec Workshop for Agreement on New UMM-Var Attributes (SES for SDPS)

CMR-5482 - EDSC-SES Agreement on SES call for SDPS Size Estimate (SES for SDPS)

CMR-5447 - SES agreement on Automated Metrics approach (OPeNDAP Log Evaluation for SES Metrics)

CMR-5448 - SES agreement on Machine Learning approach

format	maps-to (UMM-Var)
shapefile	esri_shapefile
tabular_ascii	ascii
geotiff	geotiff
<null>	native

format	maps-to (UMM-Var)
NetCDF	nc
NetCDF4-CF	nc4
NetCDF4	nc4
NetCDF-4	nc4