wfdb-python
wfdb-python copied to clipboard
Adds Amazon AWS S3 support #268
This change adds support for data stored in Amazon AWS S3 (buckets). For the moment, anywhere there is a pn_dir
parameter in a function, the user can provide the S3 URI form which to find the desired input file. This should always begin with 's3'
in order to work correctly. An example input would be: 's3://my-aws-bucket/'
.
For example, I can read record 100
(both .dat
and .hea
) stored on my-aws-bucket
like this:
>>> import wfdb
>>> rec = wfdb.rdrecord('100', pn_dir='s3://my-aws-bucket')
As mentioned in #73, I think it would be nice to transition away from WFDB's dependence on files and take an approach similar to this statement:
1. You have the wfdb-python library that only takes contents as input
2. You have helper functions (downloads from physionet and read the contents for you)
Where each helper function would be for a separate application (PhysioNet, AWS S3, Hadoop HDFS, etc.). This would probably be a significant change (I think pn_dir
should be changed to remote_dir
at least) so something to look forward to in v4.0
??
Fixes #268.
Nice to know this is possible, though I haven't tested it.
I do see a few problems with the implementation here that would need to be addressed:
-
Numerous functions check whether
remote_dir.startswith('s3')
. This conflicts with the existing namespace (for example,rdrecord("foo", pn_dir="s345db/1.0")
ought to read the recordhttp://physionet.org/files/s345db/1.0/foo
.) If reusing thepn_dir
API is desirable, it needs to be unambiguous (for example,remote_dir.startswith('s3://')
instead ofremote_dir.startswith('s3')
.) -
_stream_dat
ignoresstart_byte
andbyte_count
, which means it will give wrong results when reading a subset of a signal file, or when reading a signal file with a prolog. -
Rather than creating abstractions to simplify the logic, this adds complexity to both the high-level
record
functions and the low-leveldownload
functions.
For a lot of reasons, including reducing the existing complexity, I'm trying to replace the download functions with an abstract file I/O layer (see pull #320.)
I know nothing about s3fs, but just looking at how you're using it here, I'm guessing that it already provides a BufferedIOBase interface or something similar. Which means it ought to be simpler, and much more maintainable, to do this by building on top of pull #320 instead.
I'd welcome thoughts from anyone about what the ideal API should look like. pn_dir
is pretty clunky in my mind.