biotite
biotite copied to clipboard
Support compressed download in `database.rcsb.fetch()`
The RCSB PDB provides all files also in gzipped format. Therefore, to improve download times in database.rcsb.fetch()
, one could optionally download the gzipped files and and unzip the HTTP response content via Python's gzip
module, before writing the structure file to disk.
Hi!
I'm really keen to contribute to Biotite so I ran a few tests on this. It seems that the speed up for downloading gzipped files is fairly negligible when you account the time for required to unzip the file. The results were generated using repeat() from timeit with 10 runs and 100 repetitions (1000 repetitions in total) and are in the table below. You can find the test code here.
download type | speed (s) |
---|---|
pdb | 5.02787 |
gzipped pdb | 5.00965 |
difference | 0.01822 |
There might be a way to eek-out more performance but I'm not sure how you'd do it. If you still think this is worth adding to the library - I'm happy to finish off the implementation. Let me know what you think!
Cheers,
Ollie
Thanks for the benchmark. I created a modified version of your script (larger structure, omitted writing step) and found similar results: The differences are marginal and which one is faster is not clear.
Still a compressed download probably makes sense, in case the bandwidth is limiting. I just would not use it as the default. So if you still like to implement this feature, feel free to do so :+1:.
Awesome I'll start working on it 👍.