mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Issue 3659 fixed loading for MOL2 and PDB edge cases

Open zwsmith200 opened this issue 3 years ago • 11 comments
trafficstars

Fixes #3659

Changes made in this Pull Request:

  • New squash function squash_by_attributes that uses unique combinations of attributes to determine residues
  • MOL2Parser and PDBParser now use this function to avoid issues when loading non-unique resids and non-contiguous residues (#3659)

PR Checklist

  • [x] Tests?
  • [x] Docs?
  • [x] CHANGELOG updated?
  • [x] Issue raised/referenced?

zwsmith200 avatar May 11 '22 18:05 zwsmith200

Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2023-11-07 19:25:28 UTC

pep8speaks avatar May 11 '22 18:05 pep8speaks

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (4b8b924) 93.37% compared to head (6d965b0) 93.37%. Report is 8 commits behind head on develop.

Additional details and impacted files
@@            Coverage Diff             @@
##           develop    #3662     +/-   ##
==========================================
  Coverage    93.37%   93.37%             
==========================================
  Files          170      184     +14     
  Lines        22295    23465   +1170     
  Branches      4075     4084      +9     
==========================================
+ Hits         20818    21911   +1093     
- Misses         962     1036     +74     
- Partials       515      518      +3     
Files Coverage Δ
package/MDAnalysis/topology/MOL2Parser.py 98.27% <100.00%> (+0.03%) :arrow_up:
package/MDAnalysis/topology/PDBParser.py 99.50% <100.00%> (+0.01%) :arrow_up:
package/MDAnalysis/topology/base.py 96.61% <100.00%> (+0.95%) :arrow_up:

... and 20 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 11 '22 18:05 codecov[bot]

Welcome to contributing to MDAnalysis! Thanks for the PR.

Sorry, I don't have time to review, I just wanted to ask if this PR is going to fix #3659 ? If so, please enter the issue number in the Fixes # line at the top of your issue report so that PRs and issues get linked.

orbeckst avatar May 12 '22 10:05 orbeckst

So I should ask here - the fact that both squash_by and change_squash exist just next to each other makes me wonder if there is a file format specific reason as to why these behaviours are different :/ Is there anything in the PR history that explains why these exist?

Nevermind, I better understand the issue now.

IAlibay avatar May 12 '22 12:05 IAlibay

The scope has expanded slightly since the initial issue. Originally I was concerned that there were different behaviours when reading PDB and MOL2, namely the MOL2 format deleted the resname with a repeated resid while PDB did not. I then went to re-use change_squash from PDB and realized this will double the number of residues if they are not contiguous (for example openbabel adds hydrogens to the end of PDBs).

In light of those findings, I wrote a squash function that is robust to both repeated resids with different attributes and non-contiguous residues. I have applied this change to the MOL2 and PDB parsers but I suspect experts in other formats may want to take a look at whether similar unit tests are needed.

zwsmith200 avatar May 12 '22 15:05 zwsmith200

I need to double check this, hopefully in the next few days

richardjgowers avatar May 12 '22 15:05 richardjgowers

@zwsmith200 is this still something you are looking to contribute?

IAlibay avatar Aug 14 '23 19:08 IAlibay

Yes, I can finish cleaning this up. I got it working for my now-completed project then forgot about the PR.

zwsmith200 avatar Aug 14 '23 19:08 zwsmith200

Linter Bot Results:

Hi @zwsmith200! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/6789351225/job/18456338341


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

github-actions[bot] avatar Nov 02 '23 00:11 github-actions[bot]

Hello @zwsmith200! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 409:80: E501 line too long (80 > 79 characters) Line 418:80: E501 line too long (81 > 79 characters)

Line 325:80: E501 line too long (86 > 79 characters)

Line 388:80: E501 line too long (87 > 79 characters)

Comment last updated at 2023-11-06 01:06:50 UTC

I realize darker used max length of 88 and we use 79. It may be useful to add -l 79 here so that copying the command matches our requirements.

zwsmith200 avatar Nov 06 '23 02:11 zwsmith200

@richardjgowers @IAlibay It looks like I have addressed everything I needed to, let me know if I missed something.

zwsmith200 avatar Nov 07 '23 19:11 zwsmith200