This page outlines the algorithms that will be used for estimating the size of the output files generated when sending subsetting and re-formatting requests to the OPeNDAP server. The fundamental concept is that each processing option is considered in the estimate and contributes to the final output size. The process uses information in the UMM-C, UMM-Var(s) and UMM-G(s) to obtain information about the granules being processed. It is anticipated that the algorithm will evolve as we gain more experience with the estimation process and more collections are supported.
Here is a summary of the experiments so far, this will be updated as more information is available. In the following, "measurements" refers to the actual number of measurements which is controlled by the number of variables selected and the dimensions and size of the variables selected. We are not sure of the bounds for these measurements but we saw a low end of one variable containing an array of a couple thousand 2 byte integers and the high end of well over 100,000 4 byte floating point values. The "metadata" refers to the granule or "inventory" metadata, sometimes called "core" metadata; some environments include the metadata in outputs and some do not. In all the test cases, if the metadata was present it was not compressed.
netCDF3
When netCDF3 output is specified, the output is approximately equal to the estimated size using the the "dimensions" and "datatype" in the UMM-Var plus the size of the metadata (if present).
netCDF4
When netCDF4 is specified, the output file has the "measurements" compressed. However, the "metadata" is not compressed and it may be missing altogether if the input granule is in HDF-EOS format.
If the number of measurements in a variable is large then the netCDF4 output size can be difficult to predict; it requires a prediction of the compression rate for the measurements. One approach is to compare the original file size to the sum of the sizes for all the variables to determine a compression rate. This approach has the advantage of being calculable from simply the UMM-G and UMM-Var metadata. Another approach is to modify the UMM-Var to also store a "compression rate" to be used for estimating output sizes. We tried both approaches and the results are detailed below.
Note: If the number of measurements in the output is small (one variable with a small grid) and they compress well, then the size of the output file can be similar to the other formats; in this case the metadata and hdf overhead contribute to the majority of the output size. If the original file is also netCDF4 then the output is much larger than expected and can easily be larger than netCDF3.
ASCII
When ASCII is specified, the output file does not contain the metadata. The output size is controlled by the actual measurements. Unfortunately it is not easy to calculate the output size because the ASCII text can easily require more or less space than the binary equivalent. For example, a binary 4 byte "real" could actually take 10 bytes of text ("-999.99999") or as little as one byte ("0").
Binary
When binary is specified, the output is approximately equal to the estimated size using the "dimensions" and "datatype" that are in the UMM-Var.
Conclusion:
We could experiment with a formula based approach. We could start with something like the following and add rules as new special cases are found:
If output = netCDF4 then
output size = SizeMBDataGranule * 1024 * 1024 * VariableCompressionRate
else
output size = (compression * measurements) + metadata
Where:
Variable Compression Rate:
Compression rate determined by sampling during UMM-Var creation
measurements:
measurements = (sum(dimensions * datatype_size)) for the selected variables
compression:
if output = [nc3 | binary] then compression = 1
if output = ascii then compression > TBD
if output = nc4 & original_data_file is hdf
then compression = "average compression ratio from the UMM-Var"
NOTE: metadata is 0 if the original_data_file is hdf-eos
NOTE: removed and left here only for continuity:
---- then compression = ((SizeMBDataGranule * 1024 * 1024) - metadata) / size_of_all_variables
if output = nc4 & original_data_file is .nc4 then compression = TBD – often larger than original
if output = nc4 & original_data_file is not .nc4 or hdf then compression = TBD – need to identify examples and come up with a workable assumption
NOTE: the XDim and YDim variables often do not compress well so it would be good to experiment with treating them separately from the other variables.
metadata:
if output = [ASCII | binary] then metadata = 0
if output = nc4 and input granule is hdf-eos then metadata = 0
else metadata will be measured from the size of the UMM-G
size_of_all_variables:
This is the sum of the dimensions times the data type size for all of the variables in the OPeNDAP "dds" and thus the UMM-Var. However, this sum should not include the sizes for variables latitude and longitude; these two variables are created by OPeNDAP when processing the granule and don't actually appear in the original granule file.
The following table details the size estimate algorithm for each case, where,
Format | Algorithm | Notes |
---|---|---|
NC3 | ∑ (data type size * no. of variable values) + (size of UMM-G record) | |
NC4 (original HDF-EOS) | ∑ (VCR * (SizeMBDataGranule * 1024 * 1024)) | VCR (variable compression rate) will be variable dependent and defined in the appropriate UMM-VAR. It will be the 'average' compression rate for a variable based on a sampled subset of a small number of granules. |
NC4 (original not NC4 or HDF-EOS) | ∑ (VCR * (SizeMBDataGranule * 1024 * 1024)) + (size of UMM-G record) | Further work is required to determine if the same Variable Compression Rate will work for this case. For example, what if the original file is not compressed. |
ASCII | TBD | Metadata size is zero |
Binary | ∑ (TBD * (data type size * no. of variable values)) | Compression rate for each variable is TDB |
The following graphs show the accuracy for extracting netCDF4 outputs from three different popular collections when forming a ratio of the "size of all variables" / "the size of the granule" to predict the output file size. Again, this has the advantage that it can be done using the current UMM-Var and UMM-G definitions. All data was extracted as netCDF4 output from hdf-eos files:
1) This graph show pulling 2 variables from MOD11B1. This is the most popular type of request at LP DAAC. The predicted size seems to usually be smaller than the actual output; we believe this is because the variables being extracted don't compress as well as other variables in the file so our calculated compression rate is "better" than what is actually achieved in practice. Further testing should see if this relationship extends to other collections. Extracting all the variables in the file in one request which is also a popular option might offset this issue with the calculated compression rate. There is also considerable variation in the compression based on tile location within the MODIS grid.
2) This shows 2 variables extracted from MOD09GA. Again the predicted size is often about 1/2 of the actual.
3) Extracting one variable from MOD10A1 is the most popular request at NSIDC and the accuracy for predicting the extract of the Snow Cover variable is pretty good. Again the blue is the actual size and orange is the calculated size.
4) This chart shows data for extracting every variable from 5 MOD10A1 files. Each variable is extracted individually. This chart gives some hints into the issue with attempting to predict the size of an extracted variable. The compression rate is very different from one variable to another; some compress very well and others don't compress at all. The best 5 predictions were for a variable that was only 1 byte in size so they should be ignored. The attached spread sheet shows the details for this chart.
NOTE: this chart is using a logarithmic scale.
The above results for predicting netCDF4 output sizes were not good enough to use in a meaningful way. As a result we considered the idea of storing a "compression ratio" for each variable for each collection. The idea is that the compression ratio can be calculated by a DAAC operator by subsetting a representative sample of granules (for each variable of each collection) when the UMM-Var is created. Hopefully this can be automated to the point where the DAAC operator is only required to pick the representative sample of granules to work with and a script can access the granules and perform the necessary calculations.
Question: could the same representative set of granules be used for a large set of collections?
To evaluate the effectiveness of this approach we picked 6 MOD11B1 granules at LP DAAC to work with. The granules were from a mixture of MODIS tiles; some tiles were all land while other tiles contained a combination of land and ocean. We tried to find a good compression ratio for the LST Day and LST Night variables separately and then tried extracting both in a single request to see if the individual compression ratios would be effective in calculating the combined output.
Two approaches we used were:
1) Subset all six granules and use the compression rate achieved in each request as the basis for predicting the output size of the others. The compression rate that most accurately predicted the output sizes for all requests was picked as the winner. Accuracy was measured as a combination of average and standard deviation. For example, 90,90,90,90,10 is not as good as 73,72,73,72,73. This corresponds to the idea of finding a granule that is "representative" of all the others. The table of results below show that using the representative approach results in 1 output prediction being 100% accurate while the others have varying degrees of accuracy.
2) Subset all six granules and use the average compression rate of the six to predict the output size of each.
We repeated the above approach for extracting the Snow Cover variable from MOD10A1 at NSIDC.
The below results found that using the "average of all" approach worked well for MOD11B1 while the "representative granule" approach worked better for MOD10A1. The conclusion from this brief exercise is that it might be difficult to find a representative granule for each collection and the improvement of "average" over "representative" in MOD11B1 is greater than the improvement of "representative" over "average" in MOD10A1.
Here are the results:
You can see that coincidentally granule 5 was always chosen as the "representative granule" for each variable and it tended to perform poorly using the "average" method. This was purely a coincidence.
2nd priority spatial
Given that our first iterations of a size estimate service will restrict themselves to,
The values for a variable are expressed in a two-dimensional array. An individual cell of that array contains a value of that variable at a lat/lon point on the earth. If we know the bounds of that array we know the spatial extent of that granule. We can then determine how many cells in that array intersect with our spatial constraint (a lat/lon bounding box). If we know the size of that variable's data type we can compute the size of the product returned.
Trivial example:
In order to compute this size we need the following,
3rd priority temporal
The temporal specification is optional and mutually exclusive with the spatial specification.
Here the approach is to create a ratio of the temporal extent of the "time of interest" compared to the temporal extent of the granule and use it the same way as the spatial ratio is used. The ratio can be applied to the size determined by the format and variable service parameters to come out with a final size.
Still to do:
1) Come up with suitable compression factors for ASCII outputs, hopefully this will not be collection specific but it might have several conditions that affect it (for example: measurement datatype, fill value, etc.).
3) Determine tools for spatial computations. For example getting the area of a granule with a 1/2 orbit spatial extent. Is it possible to use Cartesian based calculations?
Here is a spreadsheet with some of the data supporting this page.