wfdb-python icon indicating copy to clipboard operation
wfdb-python copied to clipboard

Make wfdb.rdann, wfdb.rdrecord, wfdb.rdsamp also accept pathlib.Path objects as the input record_name

Open wenh06 opened this issue 2 years ago • 6 comments

Currently, the functions wfdb.rdann, wfdb.rdrecord, wfdb.rdsamp only accept str for the input record_name, as they have str specific operations. wfdb.rdheader currently accepts pathlib.Path objects as input record_name since the os.path module is able to handle pathlib.Path as well as str type input.

For example, on a linux machine

import pathlib, wfdb
sp = "/home/wenhao/Jupyter/wenhao/data/CPSC2021/training_I/data_0_1"
pp = pathlib.Path(sp)
wfdb.rdrecord(pp)

would raise error

AttributeError                            Traceback (most recent call last)
<ipython-input-6-754a0c10c458> in <module>
----> 1 wfdb.rdrecord(pp)

~/.local/lib/python3.6/site-packages/wfdb/io/record.py in rdrecord(record_name, sampfrom, sampto, channels, physical, pn_dir, m2s, smooth_frames, ignore_skew, return_res, force_channels, channel_names, warn_empty)
   3426         pn_dir = posixpath.join(dir_list[0], get_version(dir_list[0]), *dir_list[1:])
   3427 
-> 3428     if record_name.endswith('.edf'):
   3429         record = edf2mit(record_name, pn_dir=pn_dir, record_only=True)
   3430     elif record_name.endswith('.wav'):

AttributeError: 'PosixPath' object has no attribute 'endswith'

Sometimes one would like to work locally with pathlib, which was introduced in Python 3.4 (PEP 428), to better handle the file paths.

wenh06 avatar Mar 18 '22 08:03 wenh06

Thank you! This looks like a great improvement.

I'm not entirely familiar with pathlib so I'm a little unsure of some of the details.

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

Here, for example, it looks like you're passing a pathlib.Path object to download._stream_annotation. Is that a good idea or not? I guess that posixpath.join will automatically convert the argument back into a string, but will it correctly use slashes if running on a Windows platform? And does that mean that rdann would now accept backslashes in URLs, only on Windows?

If we do want the download functions to accept a pathlib.Path as an alternative to str then that should be documented.

I'm not necessarily opposed to either saying "pathlib.Path may be used only for local files" or saying "pathlib.Path may be used either for local files or for URLs", but I do think it's important to handle URL arguments in the same way on all hosts.

bemoody avatar Mar 21 '22 16:03 bemoody

Thank you! This looks like a great improvement.

I'm not entirely familiar with pathlib so I'm a little unsure of some of the details.

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

Here, for example, it looks like you're passing a pathlib.Path object to download._stream_annotation. Is that a good idea or not? I guess that posixpath.join will automatically convert the argument back into a string, but will it correctly use slashes if running on a Windows platform? And does that mean that rdann would now accept backslashes in URLs, only on Windows?

If we do want the download functions to accept a pathlib.Path as an alternative to str then that should be documented.

I'm not necessarily opposed to either saying "pathlib.Path may be used only for local files" or saying "pathlib.Path may be used either for local files or for URLs", but I do think it's important to handle URL arguments in the same way on all hosts.

You're correct. On a Windows system, if you pass a Path object (actually a WindowsPath object) for the file_name, then error might occur when this file_name includes a subdirectory for

url = posixpath.join(config.db_index_url, pn_dir, file_name)

in the function download._stream_annotation. Such error can be avoided by forcing converting file_name to a PosixPath using pathlib.PurePosixPath as

url = posixpath.join(config.db_index_url, pn_dir, pathlib.PurePosixPath(file_name))

A drawback of pathlib is that it cannot handle the URLs correctly (mainly the double slashes), for example

>>> pathlib.PurePosixPath('https://physionet.org/files/')
PurePosixPath('https:/physionet.org/files')
>>> db_index_url = 'https://physionet.org/files/'
>>> sp = "./temp/13918_AV_psd.mat"
>>> pp = pathlib.Path(sp)
>>> db_index_url / pathlib.PurePosixPath("mitdb" / pp)
PurePosixPath('https:/physionet.org/files/mitdb/temp/13918_AV_psd.mat')

The following image is an example tested on a Windows machine image

wenh06 avatar Mar 23 '22 16:03 wenh06

One thing to be wary of is that "file names" or "record names" might be either filesystem paths or URL paths. Some existing functions permit record_name to include a subdirectory when streaming data from PhysioNet; other functions don't allow that, and silently ignore any directory name in record_name (which I consider a bug.)

I also think this is a bug:

>>> wfdb.rdrecord("x_mitdb/x_111", pn_dir="mitdb", sampto=2000)
NetFileNotFoundError: 404 Error: Not Found for url: https://physionet.org/files/mitdb/1.0.0/x_111.hea
>>> wfdb.rdrecord("x_111", pn_dir="mitdb/1.0.0/x_mitdb/", sampto=2000)
<wfdb.io.record.Record at 0x7f6cd4ee7c50>

wenh06 avatar Mar 23 '22 17:03 wenh06

I think there's a lot of work to do to better deal with the paths. For example, I found currently wfdb detects whether a pn_dir contains the version number using

'/' not in db_dir

or

'.' not in pn_dir

I think it's better to specify a pattern for the database version and use re.search to detect it

DB_VERSION_PATTERN = re.compile("\d+.\d+\.\d+")

wenh06 avatar Mar 23 '22 17:03 wenh06

Thanks very much for identifying the bugs. We're going to think a bit more about how we want the API to be in the next major version, such that the behavior is more explicit. Especially when adding cloud data sources.

cx1111 avatar Apr 25 '22 16:04 cx1111

Thanks very much for identifying the bugs. We're going to think a bit more about how we want the API to be in the next major version, such that the behavior is more explicit. Especially when adding cloud data sources.

I agree with you. I tried continuing to correct (enhance) path operations in other places for several hours but finally gave up. I hope that path operations would be designed in a more uniform and modern way in the next major version.

wenh06 avatar May 29 '22 15:05 wenh06