Copy of Estimating the Output file size for OPeNDAP requests (DRAFT)

This page outlines the algorithms that will be used for estimating the size of the output files generated when sending subsetting and re-formatting requests to the OPeNDAP server. The fundamental concept is that each processing option is considered in the estimate and contributes to the final output size. The process uses information in the UMM-C, UMM-Var(s) and UMM-G(s) to obtain information about the granules being processed. It is anticipated that the algorithm will evolve as we gain more experience with the estimation process and more collections are supported.

The 1st priority is output reformatting and Variable selection

Here is a summary of the experiments so far, this will be updated as more information is available. In the following, "measurements" refers to the actual number of measurements which is controlled by the number of variables selected and the dimensions and size of the variables selected. We are not sure of the bounds for these measurements but we saw a low end of one variable containing an array of a couple thousand 2 byte integers and the high end of well over 100,000 4 byte floating point values. The "metadata" refers to the granule or "inventory" metadata, sometimes called "core" metadata; some environments include the metadata in outputs and some do not. In all the test cases, if the metadata was present it was not compressed.

netCDF3

When netCDF3 output is specified, the output is approximately equal to the estimated size using the the "dimensions" and "datatype" in the UMM-Var plus the size of the metadata (if present).

netCDF4

When netCDF4 is specified, the output file has the "measurements" compressed. However, the "metadata" is not compressed and it may be missing altogether if the input granule is in HDF-EOS format.

If the number of measurements in a variable is large then the netCDF4 output size can be difficult to predict; it requires a prediction of the compression rate for the measurements. One approach is to compare the original file size to the sum of the sizes for all the variables to determine a compression rate. This approach has the advantage of being calculable from simply the UMM-G and UMM-Var metadata. Another approach is to modify the UMM-Var to also store a "compression rate" to be used for estimating output sizes. We tried both approaches and the results are detailed below.

Note: If the number of measurements in the output is small (one variable with a small grid) and they compress well, then the size of the output file can be similar to the other formats; in this case the metadata and hdf overhead contribute to the majority of the output size. If the original file is also netCDF4 then the output is much larger than expected and can easily be larger than netCDF3.

ASCII

When ASCII is specified, the output file does not contain the metadata. The output size is controlled by the actual measurements. Unfortunately it is not easy to calculate the output size because the ASCII text can easily require more or less space than the binary equivalent. For example, a binary 4 byte "real" could actually take 10 bytes of text ("-999.99999") or as little as one byte ("0").

Binary

When binary is specified, the output is approximately equal to the estimated size using the "dimensions" and "datatype" that are in the UMM-Var.

Conclusion:

We could experiment with a formula based approach. We could start with something like the following and add rules as new special cases are found:

If output = netCDF4 then

output size = SizeMBDataGranule * 1024 * 1024 * VariableCompressionRate

else

output size = (compression * measurements) + metadata

Where:

Variable Compression Rate:

Compression rate determined by sampling during UMM-Var creation

measurements:

measurements = (sum(dimensions * datatype_size)) for the selected variables

compression:

if output = [nc3 | binary] then compression = 1

if output = ascii then compression > TBD

if output = nc4 & original_data_file is hdf

then compression = "average compression ratio from the UMM-Var"

NOTE: metadata is 0 if the original_data_file is hdf-eos

NOTE: removed and left here only for continuity:

---- then compression = ((SizeMBDataGranule * 1024 * 1024) - metadata) / size_of_all_variables

if output = nc4 & original_data_file is .nc4 then compression = TBD – often larger than original

if output = nc4 & original_data_file is not .nc4 or hdf then compression = TBD – need to identify examples and come up with a workable assumption

NOTE: the XDim and YDim variables often do not compress well so it would be good to experiment with treating them separately from the other variables.

metadata:

if output = [ASCII | binary] then metadata = 0

if output = nc4 and input granule is hdf-eos then metadata = 0

else metadata will be measured from the size of the UMM-G

size_of_all_variables:

This is the sum of the dimensions times the data type size for all of the variables in the OPeNDAP "dds" and thus the UMM-Var. However, this sum should not include the sizes for variables latitude and longitude; these two variables are created by OPeNDAP when processing the granule and don't actually appear in the original granule file.

The following table details the size estimate algorithm for each case, where,

'data type size' is the size of a single variable value
'no. of variable values' is the number of values required. For example, If a variable is spatially subsetted then we only want some of the values in the whole dimensional array

Format	Algorithm	Notes
NC3	∑ (data type size * no. of variable values) + (size of UMM-G record)
NC4 (original HDF-EOS)	∑ (VCR * (*SizeMBDataGranule 1024 * 1024**))	VCR (variable compression rate) will be variable dependent and defined in the appropriate UMM-VAR. It will be the 'average' compression rate for a variable based on a sampled subset of a small number of granules.
NC4 (original not NC4 or HDF-EOS)	∑ (VCR * (*SizeMBDataGranule 1024 * 1024**)) + (size of UMM-G record)	Further work is required to determine if the same Variable Compression Rate will work for this case. For example, what if the original file is not compressed.
ASCII	TBD	Metadata size is zero
Binary	∑ (TBD * (data type size * no. of variable values))	Compression rate for each variable is TDB

Some test results

The following graphs show the accuracy for extracting netCDF4 outputs from three different popular collections when forming a ratio of the "size of all variables" / "the size of the granule" to predict the output file size. Again, this has the advantage that it can be done using the current UMM-Var and UMM-G definitions. All data was extracted as netCDF4 output from hdf-eos files:

1) This graph show pulling 2 variables from MOD11B1. This is the most popular type of request at LP DAAC. The predicted size seems to usually be smaller than the actual output; we believe this is because the variables being extracted don't compress as well as other variables in the file so our calculated compression rate is "better" than what is actually achieved in practice. Further testing should see if this relationship extends to other collections. Extracting all the variables in the file in one request which is also a popular option might offset this issue with the calculated compression rate. There is also considerable variation in the compression based on tile location within the MODIS grid.

2) This shows 2 variables extracted from MOD09GA. Again the predicted size is often about 1/2 of the actual.

3) Extracting one variable from MOD10A1 is the most popular request at NSIDC and the accuracy for predicting the extract of the Snow Cover variable is pretty good. Again the blue is the actual size and orange is the calculated size.

4) This chart shows data for extracting every variable from 5 MOD10A1 files. Each variable is extracted individually. This chart gives some hints into the issue with attempting to predict the size of an extracted variable. The compression rate is very different from one variable to another; some compress very well and others don't compress at all. The best 5 predictions were for a variable that was only 1 byte in size so they should be ignored. The attached spread sheet shows the details for this chart.

NOTE: this chart is using a logarithmic scale.

The above results for predicting netCDF4 output sizes were not good enough to use in a meaningful way. As a result we considered the idea of storing a "compression ratio" for each variable for each collection. The idea is that the compression ratio can be calculated by a DAAC operator by subsetting a representative sample of granules (for each variable of each collection) when the UMM-Var is created. Hopefully this can be automated to the point where the DAAC operator is only required to pick the representative sample of granules to work with and a script can access the granules and perform the necessary calculations.

Question: could the same representative set of granules be used for a large set of collections?

To evaluate the effectiveness of this approach we picked 6 MOD11B1 granules at LP DAAC to work with. The granules were from a mixture of MODIS tiles; some tiles were all land while other tiles contained a combination of land and ocean. We tried to find a good compression ratio for the LST Day and LST Night variables separately and then tried extracting both in a single request to see if the individual compression ratios would be effective in calculating the combined output.

Two approaches we used were:

1) Subset all six granules and use the compression rate achieved in each request as the basis for predicting the output size of the others. The compression rate that most accurately predicted the output sizes for all requests was picked as the winner. Accuracy was measured as a combination of average and standard deviation. For example, 90,90,90,90,10 is not as good as 73,72,73,72,73. This corresponds to the idea of finding a granule that is "representative" of all the others. The table of results below show that using the representative approach results in 1 output prediction being 100% accurate while the others have varying degrees of accuracy.

2) Subset all six granules and use the average compression rate of the six to predict the output size of each.

We repeated the above approach for extracting the Snow Cover variable from MOD10A1 at NSIDC.

The below results found that using the "average of all" approach worked well for MOD11B1 while the "representative granule" approach worked better for MOD10A1. The conclusion from this brief exercise is that it might be difficult to find a representative granule for each collection and the improvement of "average" over "representative" in MOD11B1 is greater than the improvement of "representative" over "average" in MOD10A1.

Here are the results:

You can see that coincidentally granule 5 was always chosen as the "representative granule" for each variable and it tended to perform poorly using the "average" method. This was purely a coincidence.

2nd priority spatial

Given that our first iterations of a size estimate service will restrict themselves to,

OPeNDAP subsetting requests

The values for a variable are expressed in a two-dimensional array. An individual cell of that array contains a value of that variable at a lat/lon point on the earth. If we know the bounds of that array we know the spatial extent of that granule. We can then determine how many cells in that array intersect with our spatial constraint (a lat/lon bounding box). If we know the size of that variable's data type we can compute the size of the product returned.

Trivial example:

Loading...

In order to compute this size we need the following,

The size of the variable data type - obtained from the UMM-VAR
The spatial extent of the granule expressed as a range of latitude and longitude values - from UMM-G?
The granularity of the range. Does each cell represent a 1 degree by 1 degree extent? - obtained from UMM-VAR
The spatial constraint of the requested subset - obtained from user-supplied bounding box

Click here to see general solution...

The spatial subset is optional in the request for estimate. The approach for spatial subset size calculation is to find the area of the granule spatial extent and the area of the intersection of the "area of interest" with the granule spatial extent. Form a ratio of the two areas and multiply it by the size determined in the "variable and format" calculation to determine the final output file size. We are starting with a simple approach that uses Cartesian grid and minimum bounding rectangles to determine if it provides a reasonable estimate. Assume that the areaOfInterest will always overlap the granule spatial extent (i.e. the client will not submit a request to extract an area outside the spatial extent of the granule otherwise the result would be a zero sized file.

Spatial Algorithm:

output_size = sizeFromVariableFormat * (areaOf(intersection(areaOfInterest, granuleSpatialExtent)) / areaOf(granuleSpatialExtent))

Where:

sizeFromVariableFormat:

This is the size of the output which is calculated based upon the selected variables to extract and the file format for for the output.

areaOfInterest

This is the area to subset and will be specified using a "bounding box" specification. Assume that the bounding box does not cross the poles or anti-meridian.

granuleSpatialExtent

This represents the spatial coverage of the granule and will be obtained from CMR. The Estimation service will assume that spatial subsetting will only happen for "gridded" granules; this is based upon my understanding of the OPeNDAP documentation. The "granuleSpatialExtent" will be obtained from the CMR and will almost always be either GPolygon or Bounding Box (I did find a handful of AMSR-E collections that have "orbit" based granule spatial extent). The Collection "Entry Title" or "DataSetID" will be examined to determine if the granule is EASE grid, MODIS Sinusoidal Grid, or "Global" (whole earth) grid.

If the granule is EASE grid or MODIS Sinusoidal grid then attempt to extract the tile coordinates from the granules file name (OnlineAccessURL). These may not be present if the granule is also global. (TBD: how could this information be used outside the polar tiles).

If the granule uses EASE grid and the tiles are H9V9 or H9V29 then use 1 as the ratio of the two areas ; these are the polar tiles and contain very little data so the spatial subset will most likely get it all.

Otherwise obtain the spatial extent from the granule metadata.

If the granule uses a Bounding Box for it's spatial extent then the MBR is the Bounding Box.

If the granule uses a GPolygon for its spatial extent then assume that gridded data (other the the two EASE grid tiles mentioned above) will not cross the poles. Form an MBR based on the minimum and maximum latitude and longitude within the GPolygon. Also assume that if the longitude / horizontal dimension is greater than 180 degrees then the MBR crosses the anti-meridian and we need to create two MBRs. In this case repeat the calculation for each Spatial Extent MBR.

NOTE: Any "exclusion regions" in the GPolygon will be ignored.

The "areaOf" and "intersection" functions will be supplied by libraries. CMR OPeNDAP (OUS) has incorporated low-level support for both JTS & Geographiclib with a handful of wrapper functions that make using them much more intuitive.

3rd priority temporal
The temporal specification is optional and mutually exclusive with the spatial specification.

Here the approach is to create a ratio of the temporal extent of the "time of interest" compared to the temporal extent of the granule and use it the same way as the spatial ratio is used. The ratio can be applied to the size determined by the format and variable service parameters to come out with a final size.

Still to do:

1) Come up with suitable compression factors for ASCII outputs, hopefully this will not be collection specific but it might have several conditions that affect it (for example: measurement datatype, fill value, etc.).

3) Determine tools for spatial computations. For example getting the area of a granule with a 1/2 orbit spatial extent. Is it possible to use Cartesian based calculations?

Here is a spreadsheet with some of the data supporting this page.

size_estimate_worksheet.xls

Page tree

Copy of Estimating the Output file size for OPeNDAP requests (DRAFT)

The 1st priority is output reformatting and Variable selection

Some test results