It is possible to serve a large HDF4 data (720M CERES) from S3 directly using s3fs.
Turn off s3fs cache when you mount.
# ec-2user is 1000 on default AWS linux. $s3fs sdt-data -o allow_other -o uid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/hyrax/build/share/hyrax/data/hdf4 # The following example mounts MRF bucket on GeoServer. $s3fs ceres-mrf -o allow_other -o uid=1000 -o gid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/geoserver-2.15.2/data_dir/data/ceres-mrf -ouse_cache=/tmp # You can also specify path after bucket name. On CentOS, uid:gid=1001 is centos:centos. $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o multireq_max=5 /usr/share/hyrax/data/hdf4 # Use cache to boost performance. $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o multireq_max=5 /usr/share/hyrax/data/hdf4 -ouse_cache=/tmp
Give the right permission on HDF files.
$chmod go+r /usr/share/hyrax/data/hdf4/*.hdf
Running make clean on ~/hyrax/build will free up some space.
Use CentOS 7 and RPM installation to save more space.
NcML on S3 works if permission is set right, which means that you can mount different buckets.
We put Hyrax under Loadbalancer with minimum 1 and maximum 5 instances. When 10 CERES granules (1/22/2017 ~ 1/31/2017) are processed simultaneously, only 2 (1/23,1/31) succeeded. CMR to VRT generation succeeded only 1 (1/23).
Some errors are Gateway error and it is due to the short default timeout value in AWS Loadbalancer, which is 60 seconds. If you increase it to 900 seconds to match Hyrax bes.conf, you will mostly see errors due to Hyrax server itself.
If you turn on caching on s3fs mount option, Hyrax simply doesn't work with the following message:
context: Error { code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170120.hdf. It is very possible that this file is not an HDF4 file.";}^
If you increase the capacity of your instance, you can notice the speed-up easily. The following test is slicing CERES by making 24 requests using netCDF-API.
Instance | Time in seconds (minutes) | Cost |
---|---|---|
t1.micro | 2998.74 (49) | 0.01 per hour |
t2.2xlarge | 399.21 (6) | 0.47 per hour |
Thus, t1.micro is not suitable for AWS Lambda which must finish in 15 minutes. In cloud, you get what you pay for. For 8X speed up, you need to pay 47X. For 10 granules, t2 will cost $0.47 and finish in 1 hour while t1 will cost $0.10 and finish in 9 hours. Thus, extra $0.37 can save 8 hours.
Since 1/24 granule caused an error, we tested from 1/25~1/29 granules. The test result indicates that Hyrax+s3fs is fairly reliable.
Instance | Time in seconds (minutes) |
---|---|
t1.micro | |
t2.2xlarge | 1924.73 (32) |
Instance | Time in seconds |
---|---|
t1.micro | |
t2.2xlarge | n/a |
t2.2xlarge failed after processing 4 granules from 1/20 to 1/23:
http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml syntax error, unexpected $end, expecting ';' context: Error { code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf. It is very possible that this file is not an HDF4 file.";}^ Traceback (most recent call last): File "test_hyrax.py", line 21, in <module> dataset = Dataset(url) File "netCDF4/_netCDF4.pyx", line 2135, in netCDF4._netCDF4.Dataset.__init__ File "netCDF4/_netCDF4.pyx", line 1752, in netCDF4._netCDF4._ensure_nc_success OSError: [Errno -70] NetCDF: DAP server error: b'http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml'
The above error is caused by file corruption.
Time is measured with t1.micro Hyrax server. Faster network within the same region doesn't help if server is a bottleneck.
Network | Time in seconds (minutes) | Note |
---|---|---|
Internet | 2998.74 (49) | Mac OS X from The HDF Group |
Private | 3476.60 (58) | t1.micro |
Test was done on t1.micro instance.
EFS | Time in seconds (minutes) | Cost |
---|---|---|
No | 2998.74 (49) | 0.01 per hour |
Yes | 33.92 (0.56) | Standard Storage: 0.30/G per month Throughput (MB/s-Month): $6/month |