mdanalysis
mdanalysis copied to clipboard
Issue 3659 fixed loading for MOL2 and PDB edge cases
Fixes #3659
Changes made in this Pull Request:
- New squash function
squash_by_attributesthat uses unique combinations of attributes to determine residues MOL2ParserandPDBParsernow use this function to avoid issues when loading non-unique resids and non-contiguous residues (#3659)
PR Checklist
- [x] Tests?
- [x] Docs?
- [x] CHANGELOG updated?
- [x] Issue raised/referenced?
Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2023-11-07 19:25:28 UTC
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
4b8b924) 93.37% compared to head (6d965b0) 93.37%. Report is 8 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #3662 +/- ##
==========================================
Coverage 93.37% 93.37%
==========================================
Files 170 184 +14
Lines 22295 23465 +1170
Branches 4075 4084 +9
==========================================
+ Hits 20818 21911 +1093
- Misses 962 1036 +74
- Partials 515 518 +3
| Files | Coverage Δ | |
|---|---|---|
| package/MDAnalysis/topology/MOL2Parser.py | 98.27% <100.00%> (+0.03%) |
:arrow_up: |
| package/MDAnalysis/topology/PDBParser.py | 99.50% <100.00%> (+0.01%) |
:arrow_up: |
| package/MDAnalysis/topology/base.py | 96.61% <100.00%> (+0.95%) |
:arrow_up: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Welcome to contributing to MDAnalysis! Thanks for the PR.
Sorry, I don't have time to review, I just wanted to ask if this PR is going to fix #3659 ? If so, please enter the issue number in the Fixes # line at the top of your issue report so that PRs and issues get linked.
So I should ask here - the fact that both squash_by and change_squash exist just next to each other makes me wonder if there is a file format specific reason as to why these behaviours are different :/ Is there anything in the PR history that explains why these exist?
Nevermind, I better understand the issue now.
The scope has expanded slightly since the initial issue. Originally I was concerned that there were different behaviours when reading PDB and MOL2, namely the MOL2 format deleted the resname with a repeated resid while PDB did not. I then went to re-use change_squash from PDB and realized this will double the number of residues if they are not contiguous (for example openbabel adds hydrogens to the end of PDBs).
In light of those findings, I wrote a squash function that is robust to both repeated resids with different attributes and non-contiguous residues. I have applied this change to the MOL2 and PDB parsers but I suspect experts in other formats may want to take a look at whether similar unit tests are needed.
I need to double check this, hopefully in the next few days
@zwsmith200 is this still something you are looking to contribute?
Yes, I can finish cleaning this up. I got it working for my now-completed project then forgot about the PR.
Linter Bot Results:
Hi @zwsmith200! Thanks for making this PR. We linted your code and found the following:
Some issues were found with the formatting of your code.
| Code Location | Outcome |
|---|---|
| main package | ⚠️ Possible failure |
| testsuite | ⚠️ Possible failure |
Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/6789351225/job/18456338341
Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!
Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
package/MDAnalysis/topology/PDBParser.py:Line 409:80: E501 line too long (80 > 79 characters) Line 418:80: E501 line too long (81 > 79 characters)
- In the file
testsuite/MDAnalysisTests/topology/test_mol2.py:Line 325:80: E501 line too long (86 > 79 characters)
- In the file
testsuite/MDAnalysisTests/topology/test_pdb.py:Line 388:80: E501 line too long (87 > 79 characters)
Comment last updated at 2023-11-06 01:06:50 UTC
I realize darker used max length of 88 and we use 79. It may be useful to add -l 79 here so that copying the command matches our requirements.
@richardjgowers @IAlibay It looks like I have addressed everything I needed to, let me know if I missed something.