Serve data from S3 using Hyrax and s3fs

It is possible to serve a large HDF4 data (720M CERES) from S3 directly using s3fs.

Step-by-step guide

Turn off s3fs cache when you mount.

# ec-2user is 1000 on default AWS linux.
$s3fs sdt-data  -o allow_other -o uid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/hyrax/build/share/hyrax/data/hdf4

# The following example mounts MRF bucket on GeoServer.
$s3fs ceres-mrf -o allow_other -o uid=1000 -o gid=1000 -o mp_umask=022 -o     multireq_max=5 /home/ec2-user/geoserver-2.15.2/data_dir/data/ceres-mrf -ouse_cache=/tmp

# You can also specify path after bucket name. On CentOS, uid:gid=1001 is centos:centos. 
$s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o  multireq_max=5 /usr/share/hyrax/data/hdf4

# Use cache to boost performance.
$s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o     multireq_max=5 /usr/share/hyrax/data/hdf4 -ouse_cache=/tmp

Turn off MDS in Hyrax.

Give the right permission on HDF files.

$chmod go+r /usr/share/hyrax/data/hdf4/*.hdf

Running make clean on ~/hyrax/build will free up some space.
Use CentOS 7 and RPM installation to save more space.
NcML on S3 works if permission is set right, which means that you can mount different buckets.

Performance Test

Test Setting

t2.micro 1CPU and 1G memory
t2.2xlarge 8CPU and 32G memory
Hyrax 1.15.4 / CentOS 7 x86_64
AWS region: us-east-1

The Effect of Loadbalancer and Autoscaling

We put Hyrax under Loadbalancer with minimum 1 and maximum 5 instances. When 10 CERES granules (1/22/2017 ~ 1/31/2017) are processed simultaneously, only 2 (1/23,1/31) succeeded. CMR to VRT generation succeeded only 1 (1/23).

Some errors are Gateway error and it is due to the short default timeout value in AWS Loadbalancer, which is 60 seconds. If you increase it to 900 seconds to match Hyrax bes.conf, you will mostly see errors due to Hyrax server itself.

The Effect of s3fs Caching

If you turn on caching on s3fs mount option, Hyrax simply doesn't work with the following message:

context: Error {  code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170120.hdf. It is very possible that this file is not an HDF4 file.";}^

The Effect of Vertical Scaling

If you increase the capacity of your instance, you can notice the speed-up easily. The following test is slicing CERES by making 24 requests using netCDF-API.

1 Granule 24 Slices

Instance	Time in seconds (minutes)	Cost
t1.micro	2998.74 (49)	0.01 per hour
t2.2xlarge	399.21 (6)	0.47 per hour

Thus, t1.micro is not suitable for AWS Lambda which must finish in 15 minutes. In cloud, you get what you pay for. For 8X speed up, you need to pay 47X. For 10 granules, t2 will cost $0.47 and finish in 1 hour while t1 will cost $0.10 and finish in 9 hours. Thus, extra $0.37 can save 8 hours.

5 Granules 24 Slices

Since 1/24 granule caused an error, we tested from 1/25~1/29 granules. The test result indicates that Hyrax+s3fs is fairly reliable.

Instance	Time in seconds (minutes)
t1.micro
t2.2xlarge	1924.73 (32)

10 Granules 24 Slices

Instance	Time in seconds
t1.micro
t2.2xlarge	n/a

t2.2xlarge failed after processing 4 granules from 1/20 to 1/23:

http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml
syntax error, unexpected $end, expecting ';'
context: Error {  code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf. It is very possible that this file is not an HDF4 file.";}^
Traceback (most recent call last):
  File "test_hyrax.py", line 21, in <module>
    dataset = Dataset(url)
  File "netCDF4/_netCDF4.pyx", line 2135, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1752, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -70] NetCDF: DAP server error: b'http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml'

The above error is caused by file corruption.

The Effect of Private Network

Time is measured with t1.micro Hyrax server. Faster network within the same region doesn't help if server is a bottleneck.

Network	Time in seconds (minutes)	Note
Internet	2998.74 (49)	Mac OS X from The HDF Group
Private	3476.60 (58)	t1.micro

The Effect of Elastic File System

1 Granule 24 Slices

Test was done on t1.micro instance.

EFS	Time in seconds (minutes)	Cost
No	2998.74 (49)	0.01 per hour
Yes	33.92 (0.56)	Standard Storage: 0.30/G per month Throughput (MB/s-Month): $6/month

References

https://github.com/s3fs-fuse/s3fs-fuse

Space shortcuts

Page tree

Step-by-step guide

Performance Test

Test Setting

The Effect of Loadbalancer and Autoscaling

The Effect of s3fs Caching

The Effect of Vertical Scaling

1 Granule 24 Slices

5 Granules 24 Slices

10 Granules 24 Slices

The Effect of Private Network

The Effect of Elastic File System

1 Granule 24 Slices

Related articles

References