2019-02-16 - Specification Workshops on Automated Metrics and Machine Learning

Contents

Attendees

Action Items

Actions

Actions

David Auty - Cancel development of one of the objectives for this Program Increment CMR-5485 - Getting issue details... STATUS
David Auty - Evaluate CMR-5486 - Getting issue details... STATUS for a shorter-term return on investment of 4 weeks from today. Without a strong indication of promise this objective should be cancelled. Consider cancellation at this point and adopting an alternative objective, e.g., development of a "naive" or simplistic approach, possibly with a feedback or tuning aspect to improve performance.

Agenda

Agenda

Review and disposition concerns and issues documented in Size Estimate Domain, Specification Workshop notes for Automated Metrics and Machine Learning for Size Estimation Service.

Discussion

Discussion

During the discussion of these two objectives, the concern was recurrently raised regarding the return on investment, or lack there of, of the increasingly complex approach to size estimation.
It was noted there appear to be two separate use cases for a Size Estimation service
- 1) As has been discussed and noted elsewhere (ref?) - As feedback to users of e.g. EarthData Search Client (EDSC), when preparing a request for processed data results (customized order), showing the estimated processed data size across all granules selected. In particular for SDPS processing services, with ever increasingly large datasets, e.g. ICESat-2 at NSIDC.
  - As direct feedback to the user of the magnitude of their request
  - As a possible basis for limiting or throttling the request to protect the DAACs against excessively large and disruptive data requests.
- 2) A new use case for OPeNDAP processing services, when preparing the OPeNDAP processing URLs and showing the estimated size of a single granule - single URL.
  - Perhaps optionally with the ability to request across all granules, but recognizing the time that estimating all URLs and summing the results might take. Noting that in this case, the user interface should allow for a longer response time for the estimation request.
- Potentially, some UI/UX evaluation of these use cases is necessary.
- The current Size Estimation Service tries to approximate the total estimated size across all granules, without having to evaluate the estimate for each granule. This makes the estimation itself much quicker, but reduces its accuracy.
- Both the discussion of Size Estimation Metrics and the Machine Learning approach to Size Estimation focused on the data processing results per granule and summation for the results of a large request
  - The Metrics effort is looking at OPeNDAP services and the availability of actual data per request, and having toe sum those results for a multi-granule request.
  - The Machine Learning approach was favoring evaluation of estimate per granule due to the availability of a richer set of feature attributes.
- Recognizing these distinct use cases for size estimation introduces questions about
  - focusing only on the first of these, as has been the case
  - the lack of clear understanding of the context for some of our efforts.
Both Bob and Marty discussed earlier evaluations of size estimation metrics
- Bob's work is presented in SITC-504 - Getting issue details... STATUS and Fixed-compression rate approach
- Marty's work is presented in Comparing OPeNDAP-Based Size Estimates to SDPS Actuals
  - Marty, in this work, first introduced the idea of a "naive" approach to size estimation, as a first-approximation result and basis for comparison of different approaches - providing a very basic baseline for comparison purposes.
  - The naive approach does not consider details of granule or variable types or attributes. It assumes an average granule size, and a set of equally sized variables within the granules, thus yielding a relatively simple formula for estimating the size of a data processing request.
- These efforts noted that none of our more complex approaches have in practice worked out to be any more accurate that a very basic, naive approach.
Moving forwards with size-estimation should include
- Size Estimation for SDPS services - currently well underway and nearly completed - for large orders of granules
- Size Estimation for OPeNDAP services -
  - for individual granules,
  - in particular for the Path-Finder collections established for End-to-End service provision by EDSC and CMR.
  - The Path-Finder collections represent a set of relatively easy collections, as a starting point both for End-to-End service provision and size estimation.
Towards the end of the discussion it was determined -
- The current Size Estimation Metrics effort, with outstanding questions represented in the Specification Workshop notes referenced above, and in consideration of the newly outlined use-cases (see above) seems premature and of questionable value. It was recommended this effort be stopped immediately, with efforts redirected to more productive work.
- The current effort to develop a machine learning approach to size estimation is similarly of questionable value given its planned duration and complexity.
  - It was noted that we have been pursuing increasing complex approaches for too long, with poor accuracy to show for it.
  - If perhaps in four weeks from now we could show the value of a machine learning approach, that would be an acceptable level of effort (vs. an entire PI).
  - Consideration should be given to whether convincing results can be achieved in four weeks to warrant continuing with this effort at this time.
- Potentially, an alternative objective could be pursued to develop the "naive" approach for size estimation in the remaining duration of the Program Increment.
  - There was some discussion of feedback and tuning of such an approach, but further definition and evaluation is necessary.

Space shortcuts

Page tree

Actions

Agenda

Discussion