mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Implementation of fetch_pdb()

Open jauy123 opened this issue 8 months ago • 11 comments

Fixes #4907

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

  • Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
  • Added requests as a dependency
  • mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

  • [x] Issue raised/referenced?
  • [ ] Tests updated/added?
  • [ ] Documentation updated/added?
  • [ ] package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

jauy123 avatar Mar 03 '25 18:03 jauy123

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

  1. Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
  2. Integration with Mdanalysis' logger
  3. Probably could wrap the cache logic into a separate function

jauy123 avatar Mar 03 '25 18:03 jauy123

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

BradyAJohnston avatar Mar 04 '25 01:03 BradyAJohnston

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 92.73%. Comparing base (5c7c480) to head (04275e9).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
+ Coverage    92.68%   92.73%   +0.05%     
===========================================
  Files          180      180              
  Lines        22452    22474      +22     
  Branches      3186     3190       +4     
===========================================
+ Hits         20809    20842      +33     
- Misses        1169     1176       +7     
+ Partials       474      456      -18     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Mar 04 '25 20:03 codecov[bot]

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

jauy123 avatar Mar 05 '25 02:03 jauy123

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

jauy123 avatar Mar 10 '25 17:03 jauy123

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython: https://github.com/MDAnalysis/mdanalysis/blob/dcaa08790c84eb36d7b2a9ecfbf0e8367df8667c/package/MDAnalysis/analysis/align.py#L200-L207

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

BradyAJohnston avatar Mar 19 '25 12:03 BradyAJohnston

Thanks for the comment!

jauy123 avatar Mar 19 '25 23:03 jauy123

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

jauy123 avatar Mar 19 '25 23:03 jauy123

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

orbeckst avatar Mar 20 '25 18:03 orbeckst

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

orbeckst avatar Jun 12 '25 16:06 orbeckst

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

jauy123 avatar Jun 12 '25 21:06 jauy123

@BradyAJohnston @orbeckst

Can I formally ask for a code review? I finished up with my code, and I'm currently unsure where to put it. I have placed all my code in package/MDAnalysis/web and wrote tests in testsuite/MDAnalysisTests/web/ for right now. I defined a downloader class (PDBDownloader) that I think should be placed in MDAnalysis/coordinates/, and a wrapper function (fetch_pdb)for that class which should be loaded in the main MDAnalysis namespace. I'm not sure if that should be the definitive spot for that though.

jauy123 avatar Jun 26 '25 20:06 jauy123

Even without code review, you can try to make the linters happy (click and see what they're complaining about – probably start by running black on the files that you touched).

Look at the Azure tests (such as https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&jobId=c20f733f-1203-5ae6-f137-2a50b85410ce&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328 ): I see failures in your functionality https://dev.azure.com/mdanalysis/mdanalysis/_build/results?buildId=8197&view=logs&j=3c204132-2dbd-57af-ebfe-bee64916f75d&t=5bff47ff-0c7a-5995-3e15-a61472c95328&l=337 ; see if you can do something about that.

orbeckst avatar Jun 26 '25 20:06 orbeckst

Response to above questions:

  1. Can you think of other applications of BaseDownloader (eg #3377)? / Can you summarize the capabilities and advantages of your code?

Having a Superclass (i.e BaseDownloader) that has all the I/O logic built in its methods, and having multiple child instances (i.e. PDBDownloader) which handle the downloading is the main purpose segmenting into multiple classes . It is meant to encapsulate and separate download and I/O logic from each other. This allows to create multiple classes that extend BaseDownloader by providing different download logic by initializing different classes. For example, if you want to create code to download Alphafold structures, one can create an "AlphaFoldDownloader" class which rely on the logic in BaseDownloader to handle the conversion into a Universe.

I think this segmenting the code into multiple classes would overall reduce the maintenance burden rather than create one function that has every single bit of logic in it. I think that is one of the advantage of OOP in that particular manner.

  1. Can you move code from PDBDownloader into BaseDownloader to make it more reusable? As you mentioned in the prior post, the I/O part in PDBDownloader could and should be placed into BaseDownloader as that is inherently a functionality that all downloader should do.

jauy123 avatar Jul 07 '25 20:07 jauy123

@jauy123 add pooch to https://github.com/MDAnalysis/mdanalysis/blob/develop/.github/actions/setup-deps/action.yaml so that it is installed as part of the GH actions workflow.

orbeckst avatar Aug 22 '25 21:08 orbeckst

For windows Azure, make sure that your tests are skipped if pooch is not installed (there's a skipif decorator IIRC).

orbeckst avatar Aug 22 '25 21:08 orbeckst

The main code and tests for fetch_pdb() with a written docstring are done, and I would like to request a code review on what I have written. I just need to help with trying to make pooch an optional dependency and trying to get my tests to run on github automatic test runner. I also think I need to format the docstring so that it looks nicely with sphinx.

jauy123 avatar Aug 22 '25 22:08 jauy123

@IAlibay @BradyAJohnston @orbeckst @ljwoods2

Is there anything else that I need to do? I think I addressed all requested changes and this should be ready to merged.

jauy123 avatar Oct 21 '25 16:10 jauy123

@jauy123 can you please "Resolve conflicts" and get the PR into a mergeable state?

You have to move all your CHANGELOG info to the new 2.11 entry

  • move your name after IAlibay
  • move your Enhancement description to the 2.11 Enhancement section
  • be careful to not destroy anything else in the 2.10.0 entry (if in doubt, look at the pristine version https://github.com/MDAnalysis/mdanalysis/blob/release-2.10.0/package/CHANGELOG )

orbeckst avatar Oct 21 '25 16:10 orbeckst

@IAlibay @BradyAJohnston @orbeckst

I added the test, and github's machines are now returning 100% coverage. I also refilled out the changelog. I believe this PR should be okay to merge with the dev branch.

jauy123 avatar Oct 21 '25 21:10 jauy123

Ok, I think everything is order. I checked the html docs build on firefox, and it looks good from my end.

jauy123 avatar Oct 29 '25 15:10 jauy123

Thanks for the feedback!

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances.

I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. os.path). Unless the only way to work with paths in python are pathlib Path objects, I prefer to return paths as a string as that has been historical python behavior, and leave users to convert to a pathlib object if they want to.

jauy123 avatar Nov 01 '25 18:11 jauy123

@BradyAJohnston Nice to meet you at the UGM!

Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase.

jauy123 avatar Nov 12 '25 20:11 jauy123

@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

orbeckst avatar Nov 12 '25 21:11 orbeckst

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

orbeckst avatar Nov 12 '25 21:11 orbeckst

Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.

Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD.

jauy123 avatar Nov 12 '25 21:11 jauy123

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

I see the point, and I personally use pathlib.Path for all my scripts. But is there like an official PEP saying that pathlib.Path is the preferred way to deal with paths over strings? Or is it like just a community practice since pathlib has been in the standard library for so long?

My next question if anything in the standard library which only take strings, or that they all of them take Paths in python 3.10+, the minimum supported version of python that MDAnalysis supports? If pathlib.Path are basically first class objects on the same level of strings in python 3.10+ (like you mentioned in the os.path), then I see no reason to not return paths as pathlib.Path objects instead of strings.

jauy123 avatar Nov 12 '25 22:11 jauy123

the standard library still has modules that still takes paths as string (i.e. os.path)

Functions in os.path will take anything that is path-like, including a pathlib.Path

p-j-smith avatar Nov 12 '25 22:11 p-j-smith

I didn't know that os.path was so flexible. I learned something new.

orbeckst avatar Nov 12 '25 23:11 orbeckst

Small suggestion here: could we perhaps make the default format not pdb.gz but cif.gz, in line with #5142, its fixes for #5089 and the fact that:

as of April 2025 3.7% of the PDB archive is not available in legacy PDB format

source

Or at least make it default if gemmi is installed?

marinegor avatar Nov 13 '25 23:11 marinegor