wfdb-python icon indicating copy to clipboard operation
wfdb-python copied to clipboard

Adds Amazon AWS S3 support #268

Open Lucas-Mc opened this issue 3 years ago • 1 comments

This change adds support for data stored in Amazon AWS S3 (buckets). For the moment, anywhere there is a pn_dir parameter in a function, the user can provide the S3 URI form which to find the desired input file. This should always begin with 's3' in order to work correctly. An example input would be: 's3://my-aws-bucket/'.

For example, I can read record 100 (both .dat and .hea) stored on my-aws-bucket like this:

>>> import wfdb
>>> rec = wfdb.rdrecord('100', pn_dir='s3://my-aws-bucket')

As mentioned in #73, I think it would be nice to transition away from WFDB's dependence on files and take an approach similar to this statement:

1. You have the wfdb-python library that only takes contents as input
2. You have helper functions (downloads from physionet and read the contents for you)

Where each helper function would be for a separate application (PhysioNet, AWS S3, Hadoop HDFS, etc.). This would probably be a significant change (I think pn_dir should be changed to remote_dir at least) so something to look forward to in v4.0??

Fixes #268.

Lucas-Mc avatar Apr 29 '21 01:04 Lucas-Mc

Nice to know this is possible, though I haven't tested it.

I do see a few problems with the implementation here that would need to be addressed:

  • Numerous functions check whether remote_dir.startswith('s3'). This conflicts with the existing namespace (for example, rdrecord("foo", pn_dir="s345db/1.0") ought to read the record http://physionet.org/files/s345db/1.0/foo.) If reusing the pn_dir API is desirable, it needs to be unambiguous (for example, remote_dir.startswith('s3://') instead of remote_dir.startswith('s3').)

  • _stream_dat ignores start_byte and byte_count, which means it will give wrong results when reading a subset of a signal file, or when reading a signal file with a prolog.

  • Rather than creating abstractions to simplify the logic, this adds complexity to both the high-level record functions and the low-level download functions.

For a lot of reasons, including reducing the existing complexity, I'm trying to replace the download functions with an abstract file I/O layer (see pull #320.)

I know nothing about s3fs, but just looking at how you're using it here, I'm guessing that it already provides a BufferedIOBase interface or something similar. Which means it ought to be simpler, and much more maintainable, to do this by building on top of pull #320 instead.

I'd welcome thoughts from anyone about what the ideal API should look like. pn_dir is pretty clunky in my mind.

bemoody avatar Aug 20 '21 21:08 bemoody