mdanalysis
mdanalysis copied to clipboard
Implementation of fetch_pdb()
Fixes #4907
Changes made in this Pull Request:
This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.
- Added two classes:
DownloaderBaseand 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank. - Added
requestsas a dependency mda.fetch_pdb()is implemented as a wrapper to commonly used option in 'PDBDownloader'
PR Checklist
- [x] Issue raised/referenced?
- [ ] Tests updated/added?
- [ ] Documentation updated/added?
- [ ]
package/CHANGELOGfile updated?
Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/
I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere
Some stuff which I like to still add (besides tests and docs):
Verboseoption forPdbDownloader.download()(I think tqdm was a dependency last time I saw?)- Integration with Mdanalysis' logger
- Probably could wrap the cache logic into a separate function
I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).
Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 92.73%. Comparing base (5c7c480) to head (04275e9).
Additional details and impacted files
@@ Coverage Diff @@
## develop #4943 +/- ##
===========================================
+ Coverage 92.68% 92.73% +0.05%
===========================================
Files 180 180
Lines 22452 22474 +22
Branches 3186 3190 +4
===========================================
+ Hits 20809 20842 +33
- Misses 1169 1176 +7
+ Partials 474 456 -18
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
I'm ok with that. I can make the code raise an exception if requests is not in the environment.
Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.
You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython: https://github.com/MDAnalysis/mdanalysis/blob/dcaa08790c84eb36d7b2a9ecfbf0e8367df8667c/package/MDAnalysis/analysis/align.py#L200-L207
I'm not an expert on the pipelines so someone else would have to pitch in more on that.
Thanks for the comment!
I happen to have another question!
Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.
The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.
Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)
From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)
FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.
In principle, tests should pass everywhere.
The Azure tests time out in the test
_________________________ Test_Fetch_Pdb.test_timeout _________________________
which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.
@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!
I have time again. I was busy starting the end of spring break with comps, classes, and you know what.
@BradyAJohnston @orbeckst
Can I formally ask for a code review? I finished up with my code, and I'm currently unsure where to put it. I have placed all my code in package/MDAnalysis/web and wrote tests in testsuite/MDAnalysisTests/web/ for right now. I defined a downloader class (PDBDownloader) that I think should be placed in MDAnalysis/coordinates/, and a wrapper function (fetch_pdb)for that class which should be loaded in the main MDAnalysis namespace. I'm not sure if that should be the definitive spot for that though.
Even without code review, you can try to make the linters happy (click and see what they're complaining about – probably start by running black on the files that you touched).
Look at the Azure tests (such as https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&jobId=c20f733f-1203-5ae6-f137-2a50b85410ce&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328 ): I see failures in your functionality https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328&l=337 ; see if you can do something about that.
Response to above questions:
- Can you think of other applications of BaseDownloader (eg #3377)? / Can you summarize the capabilities and advantages of your code?
Having a Superclass (i.e BaseDownloader) that has all the I/O logic built in its methods, and having multiple child instances (i.e. PDBDownloader) which handle the downloading is the main purpose segmenting into multiple classes . It is meant to encapsulate and separate download and I/O logic from each other. This allows to create multiple classes that extend BaseDownloader by providing different download logic by initializing different classes. For example, if you want to create code to download Alphafold structures, one can create an "AlphaFoldDownloader" class which rely on the logic in BaseDownloader to handle the conversion into a Universe.
I think this segmenting the code into multiple classes would overall reduce the maintenance burden rather than create one function that has every single bit of logic in it. I think that is one of the advantage of OOP in that particular manner.
- Can you move code from PDBDownloader into BaseDownloader to make it more reusable?
As you mentioned in the prior post, the I/O part in
PDBDownloadercould and should be placed intoBaseDownloaderas that is inherently a functionality that all downloader should do.
@jauy123 add pooch to https://github.com/MDAnalysis/mdanalysis/blob/develop/.github/actions/setup-deps/action.yaml so that it is installed as part of the GH actions workflow.
For windows Azure, make sure that your tests are skipped if pooch is not installed (there's a skipif decorator IIRC).
The main code and tests for fetch_pdb() with a written docstring are done, and I would like to request a code review on what I have written. I just need to help with trying to make pooch an optional dependency and trying to get my tests to run on github automatic test runner. I also think I need to format the docstring so that it looks nicely with sphinx.
@IAlibay @BradyAJohnston @orbeckst @ljwoods2
Is there anything else that I need to do? I think I addressed all requested changes and this should be ready to merged.
@jauy123 can you please "Resolve conflicts" and get the PR into a mergeable state?
You have to move all your CHANGELOG info to the new 2.11 entry
- move your name after IAlibay
- move your Enhancement description to the 2.11 Enhancement section
- be careful to not destroy anything else in the 2.10.0 entry (if in doubt, look at the pristine version https://github.com/MDAnalysis/mdanalysis/blob/release-2.10.0/package/CHANGELOG )
@IAlibay @BradyAJohnston @orbeckst
I added the test, and github's machines are now returning 100% coverage. I also refilled out the changelog. I believe this PR should be okay to merge with the dev branch.
Ok, I think everything is order. I checked the html docs build on firefox, and it looks good from my end.
Thanks for the feedback!
Another thought for the future (and a possible fetchers module) is that
fetch()should perhaps returnpathlib.Pathinstances.
I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. os.path). Unless the only way to work with paths in python are pathlib Path objects, I prefer to return paths as a string as that has been historical python behavior, and leave users to convert to a pathlib object if they want to.
@BradyAJohnston Nice to meet you at the UGM!
Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase.
@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.
Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).
Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.
Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.
I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD.
Regarding
pathlib: UsingPathobjects is forward looking and anything thatos.pathcan do, you can do with aPath, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of aPathfor legacy code, usestr(pth).
I see the point, and I personally use pathlib.Path for all my scripts. But is there like an official PEP saying that pathlib.Path is the preferred way to deal with paths over strings? Or is it like just a community practice since pathlib has been in the standard library for so long?
My next question if anything in the standard library which only take strings, or that they all of them take Paths in python 3.10+, the minimum supported version of python that MDAnalysis supports? If pathlib.Path are basically first class objects on the same level of strings in python 3.10+ (like you mentioned in the os.path), then I see no reason to not return paths as pathlib.Path objects instead of strings.
the standard library still has modules that still takes paths as string (i.e. os.path)
Functions in os.path will take anything that is path-like, including a pathlib.Path
I didn't know that os.path was so flexible. I learned something new.
Small suggestion here: could we perhaps make the default format not pdb.gz but cif.gz, in line with #5142, its fixes for #5089 and the fact that:
as of April 2025 3.7% of the PDB archive is not available in legacy PDB format
Or at least make it default if gemmi is installed?