Performance issue while reading the data with FLAC file format via HTTP.
I tested the performance of wfdb python library for reading waveforms via HTTP.
- mimic3wdb
dtstart = datetime.datetime.now()
wfdb.rdsamp('3000003', pn_dir='mimic3wdb/1.0/30/3000003')
print(datetime.datetime.now() - dtstart)
# results 0:00:21.143365
- mitdb
dtstart = datetime.datetime.now()
wfdb.rdsamp('100', pn_dir='mitdb/1.0.0')
print(datetime.datetime.now() - dtstart)
# results 0:00:02.764091
It looks great. However, when I tried to read the mimic4wdb which has FLAC format, there was a significant performance decrease.
- mimic4wdb
dtstart = datetime.datetime.now()
wfdb.rdsamp('81739927', pn_dir='mimic4wdb/0.1.0/waves/p100/p10014354/81739927')
print(datetime.datetime.now() - dtstart)
# results 0:07:26.220685
This issue was resolved when I cached files with the buffering = -2 for openurl function in _url.py. -> results 0:00:37.388115
After digging a bit more, I figured out that this problem is caused by repeatedly calling read function frame by frame in the _cdata_io function in soundfile.py. Whenever the read function is called, session.request in _url.py is called and HTTP communication is established. This can cause significant performance problem and also make stress to the web server.
So it seems a good idea to change buffering=-2 to default until this is fixed. Reducing the number of requests is much more efficient in both improving the performance and reducing the load of the web server.