It is possible to serve a large HDF4 data (720M CERES) from S3 directly using s3fs.

Step-by-step guide

  1. Turn off s3fs cache when you mount.

    # ec-2user is 1000 on default AWS linux.
    $s3fs sdt-data  -o allow_other -o uid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/hyrax/build/share/hyrax/data/hdf4
    
    # The following example mounts MRF bucket on GeoServer.
    $s3fs ceres-mrf -o allow_other -o uid=1000 -o gid=1000 -o mp_umask=022 -o     multireq_max=5 /home/ec2-user/geoserver-2.15.2/data_dir/data/ceres-mrf -ouse_cache=/tmp
    
    # You can also specify path after bucket name. On CentOS, uid:gid=1001 is centos:centos. 
    $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o  multireq_max=5 /usr/share/hyrax/data/hdf4
    
    # Use cache to boost performance.
    $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o     multireq_max=5 /usr/share/hyrax/data/hdf4 -ouse_cache=/tmp
    
  2. Turn off MDS in Hyrax.
  3. Give the right permission on HDF files.

    $chmod go+r /usr/share/hyrax/data/hdf4/*.hdf

Running make clean on ~/hyrax/build  will free up some space.
Use CentOS 7 and RPM installation to save more space.
NcML on S3 works if permission is set right, which means that you can mount different buckets.

Performance Test

Test Setting

  • t2.micro 1CPU and 1G memory
  • t2.2xlarge 8CPU and 32G memory
  • Hyrax 1.15.4 / CentOS 7 x86_64
  • AWS region: us-east-1

The Effect of Loadbalancer and Autoscaling

  We put Hyrax under Loadbalancer with minimum 1 and maximum 5 instances. When 10 CERES granules (1/22/2017 ~ 1/31/2017) are processed simultaneously, only 2 (1/23,1/31) succeeded. CMR to VRT generation succeeded only 1 (1/23).

  Some errors are Gateway error and it is due to the short default timeout value in AWS Loadbalancer, which is 60 seconds. If you increase it to 900 seconds to match Hyrax bes.conf, you will mostly see errors due to Hyrax server itself.

The Effect of s3fs Caching

  If you turn on caching on s3fs mount option, Hyrax simply doesn't work with the following message:

context: Error {  code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170120.hdf. It is very possible that this file is not an HDF4 file.";}^

The Effect of Vertical Scaling

  If you increase the capacity of your instance, you can notice the speed-up easily. The following test is slicing CERES by making 24 requests using netCDF-API.

1 Granule 24 Slices

InstanceTime in seconds (minutes)Cost
t1.micro2998.74 (49)0.01 per hour
t2.2xlarge399.21 (6)0.47 per hour

Thus, t1.micro is not suitable for AWS Lambda which must finish in 15 minutes. In cloud, you get what you pay for.  For 8X speed up, you need to pay 47X. For 10 granules, t2 will cost $0.47 and finish in 1 hour while t1 will cost $0.10 and finish in 9 hours. Thus, extra $0.37 can save 8 hours.

5 Granules 24 Slices

Since 1/24 granule caused an error, we tested from 1/25~1/29 granules. The test result indicates that Hyrax+s3fs is fairly reliable.

InstanceTime in seconds (minutes)
t1.micro
t2.2xlarge1924.73 (32)

10 Granules 24 Slices

InstanceTime in seconds
t1.micro
t2.2xlargen/a


t2.2xlarge failed after processing 4 granules from 1/20 to 1/23:

http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml
syntax error, unexpected $end, expecting ';'
context: Error {  code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf. It is very possible that this file is not an HDF4 file.";}^
Traceback (most recent call last):
  File "test_hyrax.py", line 21, in <module>
    dataset = Dataset(url)
  File "netCDF4/_netCDF4.pyx", line 2135, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1752, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -70] NetCDF: DAP server error: b'http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml'

The above error is caused by file corruption.

The Effect of Private Network

Time is measured with t1.micro Hyrax server. Faster network within the same region doesn't help if server is a bottleneck.

NetworkTime in seconds (minutes)Note
Internet2998.74 (49)Mac OS X from The HDF Group
Private3476.60 (58)t1.micro

The Effect of Elastic File System

1 Granule 24 Slices

Test was done on t1.micro instance.

EFSTime in seconds (minutes)Cost
No2998.74 (49)0.01 per hour
Yes33.92 (0.56)Standard Storage: 0.30/G per month
Throughput (MB/s-Month): $6/month

References

  1. https://github.com/s3fs-fuse/s3fs-fuse