mdanalysis Feature/dssp

trafficstars

Fixes #1612

Changes made in this Pull Request:

introduces MDAnalysis.analysis.dssp.DSSP class for secondary structure analysis, using code implemented in pydssp package available for secondary structure annotation
adds tests from pydssp package

PR Checklist

[x] Tests?
[x] Docs?
[x] CHANGELOG updated?
[x] Issue raised/referenced?

Developers certificate of origin

[x] I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

:books: Documentation preview :books:: https://mdanalysis--4304.org.readthedocs.build/en/4304/

Sep 29 '23 14:09 marinegor

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/dssp/dssp.py:

Line 2:80: E501 line too long (90 > 79 characters) Line 14:80: E501 line too long (87 > 79 characters) Line 16:80: E501 line too long (115 > 79 characters) Line 27:80: E501 line too long (83 > 79 characters) Line 46:80: E501 line too long (80 > 79 characters) Line 62:80: E501 line too long (80 > 79 characters) Line 71:80: E501 line too long (84 > 79 characters) Line 78:80: E501 line too long (83 > 79 characters) Line 80:80: E501 line too long (84 > 79 characters) Line 82:80: E501 line too long (80 > 79 characters) Line 115:1: W293 blank line contains whitespace Line 116:59: W291 trailing whitespace Line 117:80: E501 line too long (85 > 79 characters) Line 118:80: E501 line too long (81 > 79 characters) Line 118:82: W291 trailing whitespace Line 119:80: E501 line too long (84 > 79 characters) Line 120:80: E501 line too long (83 > 79 characters) Line 123:1: W293 blank line contains whitespace Line 125:80: E501 line too long (86 > 79 characters) Line 125:87: W291 trailing whitespace Line 127:80: E501 line too long (84 > 79 characters) Line 127:85: W291 trailing whitespace Line 128:80: E501 line too long (80 > 79 characters) Line 128:81: W291 trailing whitespace Line 132:1: W293 blank line contains whitespace Line 135:1: W293 blank line contains whitespace Line 136:80: E501 line too long (85 > 79 characters) Line 137:80: E501 line too long (81 > 79 characters) Line 199:75: W291 trailing whitespace Line 228:80: E501 line too long (80 > 79 characters) Line 232:78: W291 trailing whitespace Line 233:80: W291 trailing whitespace Line 234:80: E501 line too long (83 > 79 characters) Line 266:72: W291 trailing whitespace Line 273:80: E501 line too long (83 > 79 characters) Line 299:80: E501 line too long (86 > 79 characters) Line 305:80: E501 line too long (82 > 79 characters) Line 312:80: E501 line too long (81 > 79 characters) Line 312:82: W291 trailing whitespace Line 313:80: E501 line too long (81 > 79 characters) Line 322:80: E501 line too long (91 > 79 characters) Line 323:80: E501 line too long (87 > 79 characters) Line 332:80: E501 line too long (91 > 79 characters) Line 361:80: E501 line too long (81 > 79 characters) Line 392:80: E501 line too long (87 > 79 characters)

In the file package/MDAnalysis/analysis/dssp/pydssp_numpy.py:

Line 68:80: E501 line too long (85 > 79 characters) Line 157:80: E501 line too long (84 > 79 characters) Line 179:80: E501 line too long (83 > 79 characters) Line 206:80: E501 line too long (80 > 79 characters) Line 207:80: E501 line too long (81 > 79 characters)

In the file testsuite/MDAnalysisTests/analysis/test_dssp.py:

Line 10:80: E501 line too long (81 > 79 characters) Line 11:80: E501 line too long (85 > 79 characters) Line 12:80: E501 line too long (82 > 79 characters) Line 30:80: E501 line too long (80 > 79 characters) Line 42:80: E501 line too long (80 > 79 characters) Line 52:80: E501 line too long (80 > 79 characters) Line 55:80: E501 line too long (82 > 79 characters)

Comment last updated at 2024-06-13 05:15:32 UTC

Sep 29 '23 14:09 pep8speaks

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9494112403/job/26163914658

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

Sep 29 '23 14:09 github-actions[bot]

There's a problem I don't know the workaround for yet: prolines.

In the original pydssp following happens:

atoms get parsed from pdb regardless of residue type
hydrogens get guilt automatically
based on that, secondary structure is assigned

However, in my implementation all prolines (and N-terminal residue) are assigned '-' (=unstructured loop).

This is likely less correct than original implementation, but I'm not sure how to fix that. My suggestion would be to:

if guess_hydrogens=False, use positions of existing hydrogens and fill in fake proline hydrogens with pydssp
if guess_hydrogens=True, default to pydssp behaviour with fully automatic hydrogens

Sep 29 '23 23:09 marinegor

The prolines problem is now dealt with, exactly as I described above -- if guess_hydrogens=False, it would take hydrogens from atoms that have them, and otherwise (=prolines & N-term) guess their positions. Otherwise, would guess all the positions.

Oct 02 '23 16:10 marinegor

I surprisingly couldn't fix the sdist_check_and_build job, even though I added einops (no runtime dependency numpy wrapper) to the install_requires in setup.py. If anyone could help, I'd be happy to proceed with the PR :)

Jan 14 '24 21:01 marinegor

I surprisingly couldn't fix the sdist_check_and_build job, even though I added einops (no runtime dependency numpy wrapper) to the install_requires in setup.py. If anyone could help, I'd be happy to proceed with the PR :)

I'll have a look at packaging later this week but I would ask that einops not be a core dependency - imports should be optional for it (for now wrap with try/except, I'll try to get moving with my optional imports PR soon).

Jan 14 '24 22:01 IAlibay

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.49%. Comparing base (45fb664) to head (d3d2cf2). Report is 4 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4304      +/-   ##
===========================================
- Coverage    93.64%   89.49%   -4.16%     
===========================================
  Files          168      183      +15     
  Lines        21254    22294    +1040     
  Branches      3918     3427     -491     
===========================================
+ Hits         19904    19951      +47     
- Misses         892     1893    +1001     
+ Partials       458      450       -8

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jan 15 '24 17:01 codecov[bot]

I have a fundamental question -- there are no tests in pydssp for the core functions, like hbond_map etc, are technically not covered by the tests --> codecov isn't happy with the PR.

What should I do about that? Here I kinda relied on pyDSSP to implement things right, and checked with the secondary structure assignments directly, without checking the core numpy functions for that. Should I still do that (which would slow down the PR tremendously since I've no idea how to do that)?

Jan 17 '24 21:01 marinegor

@marinegor I've not had time to look at this PR properly yet, but are you vendoring a bunch of code from elsewhere? If so then yes tests would be needed - otherwise there's no way to know if we're getting regressions.

That being said - if we are vendoring code, the question would be why and can we avoid that? If something upstream changes / is fixed, how would we know the fix happened?

Jan 17 '24 21:01 IAlibay

@IAlibay

but are you vendoring a bunch of code from elsewhere?

Yes -- I took MIT licensed pydssp (https://github.com/ShintaroMinami/PyDSSP) and inserted its functions into MDAnalysis. The reason I didn't include it as a dependency is that the original package is small but includes both pytorch and numpy implementations, which would require pytorch as dependency here.

if we are vendoring code, the question would be why

because I don't know how to write secondary structure implementation myself with proper tests (all but the secondary structure itself)

If something upstream changes / is fixed, how would we know the fix happened?

There is no way indeed.

One option probably would to fork pydssp, remove pytorch as dependency, and include forked repo as dependency here? Not sure if it helps much though.

Jan 17 '24 22:01 marinegor

I see, I wouldn't necessarily be advocating for maintaining a fork of some other repo - that invites a lot more maintenance burden.

Is there any way to convince upstream to make the dependency move themselves?
Could this not be an MDAKit instead? If we're worried about upstream dependencies then that's really what kits are for.

pinging @MDAnalysis/coredevs for opinions

Jan 17 '24 22:01 IAlibay

Is there any way to convince upstream to make the dependency move themselves?

I raised an issue on their repo in september, doesn't seem to be noted yet.

Could this not be an MDAKit instead? If we're worried about upstream dependencies then that's really what kits are for.

I'm not sure if we want to make protein secondary structure a separate MDAKit, it seems quite useful in the core analysis.

Jan 17 '24 22:01 marinegor

I'm not sure if @IAlibay 's ping worked, so raising this again.

Jan 31 '24 18:01 marinegor

Thanks @marinegor, pinging @MDAnalysis/coredevs again - please weigh in on this.

Jan 31 '24 19:01 IAlibay

@marinegor this is a tricky one.

These are my personal opinions.

I would say that we probably shouldn't vendor a bunch of code from elsewhere. I would suggest a combination of what @IAlibay suggests and what you suggest.

Try and convince upstream to make optional dependency (you can contact them on an issue or email or PR) likely an uphill fight though so I wouldn't waste too much time.
As PyDSSP can be an optional dependency, IMO its no stress if it has some heavy dependencies itself (query @IAlibay), the user / we accept this as an issue only if you use this module if you know what I mean?
I am fine with it in core or as a kit, I think SSP is a big missing analysis a lot of people would use from core.

Sorry for slow reply.

Feb 02 '24 00:02 hmacdope

I should add that I can work to get pydssp on conda-forge if required.

Feb 02 '24 00:02 hmacdope

We are talking about https://github.com/ShintaroMinami/PyDSSP/blob/master/pydssp/pydssp_numpy.py, right? 108 lines of code. That's less than pyqcprot, which we fully vendored (a long time ago).

In principle I agree with the idea that we don't want to be in the business of maintaining other code... except when it's really important for us. DSSP is missing and it should be a core analysis. That does not preclude kits that may want to do fancy things or support all of the original DSSP classifiers, but we should ship a minimal base version that works out of the box.

In my view we can either make pydssp a core dep (seems like a lot of work and given the lack of action on the PyDSSP repo we will probably effectively maintaining the package) or we vendor pydssp_numpy.py and run the tests that @marinegor already wrote.

I'd be in favor of the second solution but I would not include the pydssp_numpy.py code into dssp.py as done at the moment. Instead I would create a subpackage analysis/dssp (similar to analysis/hydrogenbonds) and drop dssp.py and pydssp_numpy.py into that subpackage and keep them as separate as possible. I'd write dssp.py to use an installed pydssp if available and otherwise fall back to the vendored pydssp_numpy.py.

Feb 02 '24 00:02 orbeckst

@orbeckst I like the sound of your second solution, I think covers all bases while maintaining lots of the great work done in this PR.

Feb 02 '24 01:02 hmacdope

@orbeckst just to check - by suggesting that we vendor the code, are you saying that we should be making einops a core dependency of MDAnalysis? Having vendored code with optional deps is possible, but maybe breaks the point of trying to avoid an optional dep with another optional dep?

Feb 02 '24 03:02 IAlibay

If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.

If we were to use einops more in other parts (which we could do, I think it's cool) then I don't think it would be a burden: when I looked at its pyproject.toml briefly it seemed to have no deps and treat every backend as optional (which may be an interesting trick for us to learn). However, until that's the case, keeping it optional seems totally fine to me.

Feb 02 '24 07:02 orbeckst

thanks everyone for your input -- point by point below

@hmacdope

Try and convince upstream to make optional dependency (you can contact them on an issue or email or PR) likely an uphill fight though so I wouldn't waste too much time.

From what I'm seeing, pydssp is simply a weekend project that, even if we convince them to make changes, won't be supported (all main commits in 2 days, and then 1 fix 3 months after 2 years ago).

@orbeckst

In principle I agree with the idea that we don't want to be in the business of maintaining other code... except when it's really important for us. DSSP is missing and it should be a core analysis. That does not preclude kits that may want to do fancy things or support all of the original DSSP classifiers, but we should ship a minimal base version that works out of the box.

I support this -- the existing ~100 lines of code that I took from there doesn't do anything fancy (e.g. doesn't distinguish 3-10 vs pi helices, or different beta-sheets), but is a good basis for their implementation. It can be implemented in future even within mdanalysis, but it's unlikely that the author of the original package will do that.

I'd be in favor of the second solution but I would not include the pydssp_numpy.py code into dssp.py as done at the moment. Instead I would create a subpackage analysis/dssp (similar to analysis/hydrogenbonds) and drop dssp.py and pydssp_numpy.py into that subpackage and keep them as separate as possible. I'd write dssp.py to use an installed pydssp if available and otherwise fall back to the vendored pydssp_numpy.py.

That seems like a viable option to me.

If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.

I don't think there are any workarounds that allow running dssp without einops -- I haven't changed the code, so it's still dependent on their core functions. But I do agree that they're cool, and having them as a dependency isn't a burden simply by their design. And also they're indeed a library, and likely will be supported & tested, in contrast to pydssp, so having it as a dependency is certainly a better idea @IAlibay

To sum up, I'll try to

[x] move analysis/dssp.py into a separate submodule
[x] and re-write it to use installed pydssp if possible, or use built-in functions vendored from there, if not
[x] try to get of einops if possible (i'll anyway be writing some tests for the helper functions, and probably will come up with something in the meantime)

Feb 05 '24 12:02 marinegor

If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.

If we were to use einops more in other parts (which we could do, I think it's cool) then I don't think it would be a burden: when I looked at its pyproject.toml briefly it seemed to have no deps and treat every backend as optional (which may be an interesting trick for us to learn). However, until that's the case, keeping it optional seems totally fine to me.

@orbeckst the main issue here is how well maintained that package is - even pure python packages go out of maintenance, see py3.12 and how breaking it was. Given that their CI doesn't even have an entry for py3.12 I'd be rather careful here.

Feb 05 '24 12:02 IAlibay

the main issue here is how well maintained that package is - even pure python packages go out of maintenance, see py3.12 and how breaking it was. Given that their CI doesn't even have an entry for py3.12 I'd be rather careful here.

Fair point, although pure Python/noarch is probably still the one that is easiest to catch up. (Maybe make a PR for their CI to include 3.12.)

I don't think there are any workarounds that allow running dssp without einops -- I haven't changed the code, so it's still dependent on their core functions.

Thanks for the clarification (EDIT: Originally I thought that the dssp code could work without einops but I hadn't looked carefully enough): The current DSSP implementation requires einops for repeat and rearrange. From a quick look at https://einops.rocks/#api , under numpy they correspond to

repeat -> np.tile(x, ...)
rearrange -> x.view(x.shape[0], -1) (depending on what you wanted to rearrange)

although changing py_dssp.py to fall back to numpy versions if einops doesn't exist will likely be pretty ugly. Then the question would be if it's even worth keeping einops around — I don't know if performance is an issue here, but readability is.

It sounds more sane to me to keep py_dssp.py around as written. Opinions?

Ultimately this is coming down to how we would want to handle the einops dependency. The safer approach is as an optional dependency (than as a core dep). I think that's ok for something in analysis. If we get feedback from users who are annoyed/confused if dssp doesn't work out of the box then we can rethink.

@IAlibay — if we make einops optional (in pyproject.toml and the pypi package), would we still make it a normal dependency in the conda package? How are we currently doing this for other optional dependencies?

Feb 05 '24 17:02 orbeckst

just to check - by suggesting that we vendor the code, are you saying that we should be making einops a core dependency of MDAnalysis? Having vendored code with optional deps is possible, but maybe breaks the point of trying to avoid an optional dep with another optional dep?

Basically yes (EDIT: or no?): Vendor py_dssp.py (which is from a package that is not continuously maintained) and make it optionally dependent on einops (which looks pretty well maintained). (EDIT: so not make einops are core dependency, along the line of keeping dependencies slim)

EDIT: If someone writes a cython version for dssp then we can eventually slot that in to replace py_dssp.py but in order to provide some version of dssp this looks good enough to me.

Feb 05 '24 17:02 orbeckst

@orbeckst

although changing py_dssp.py to fall back to numpy versions if einops doesn't exist will likely be pretty ugly. Then the question would be if it's even worth keeping einops around — I don't know if performance is an issue here, but readability is.

I don't think there is much performance issue -- seems like under the hood einops just comes up with a pure numpy function based on tile/einsum/etc, just the basic set of functions is slightly different from framework to framework.

In principle, it could be possible to get the intermediate representation -- I asked a question on their discussions about it. We'll see how it goes. UPD: wow that was fast (8 minutes) -- there's no obvious way to "decompile" numpy backend, but you can do that with jax backend, and kinda guess how the numpy function would look like. This, combined with the fact that most of rearrange/repeat uses are associated with batching/re-batching, gives me hope that I can re-write that cleaner and without dependencies.

It sounds more sane to me to keep py_dssp.py around as written. Opinions?

To be honest, most of the einops operations are actually related to the fact that it's batched, i.e. allows for computation for multiple structures at a time, either with numpy or pytorch. If we remove that, the code will likely benefit from that both in readability and dependendability (?) -- einops won't be an issue anymore.

Feb 05 '24 23:02 marinegor

In my personal opinon einops is popular enough (7.6k stars, 20k dependents) to just have as a dependency.

Feb 05 '24 23:02 hmacdope

Fair point, although pure Python/noarch is probably still the one that is easiest to catch up. (Maybe make a PR for their CI to include 3.12.)

This would be a good idea, there haven't been releases since October 23 and I don't see any mentions of Py3.12. We should at the very least check that sdist builds on py3.12.

Feb 05 '24 23:02 IAlibay

Hi @MDAnalys/coredevs, re-writing with numpy turned out not that a big deal -- see the updated version.

I rewrote everything so that it uses pydssp if it's installed, otherwise falls back to my implementation, and added pydssp to pyproject.toml.

Now I have one issue with the code: if someone decides to do pip install MDAnalysis[analysis], they will automatically pull pydssp, hence automatically pull torch, which is kinda almost 1 Gb. I really don't like this, since torch isn't used in this implementation at all.

I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.

Feb 06 '24 22:02 marinegor

I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.

You could make a special sub-section, something along the lines of extras or whatever fancy thing you want - then you can have it in pyproject.toml.

Feb 06 '24 22:02 IAlibay

I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.

I agree that we shouldn't pull pydssp under normal circumstances.

You can do what @IAlibay suggests.

Perhaps also add a section to the docs stating explicitly that you can pip install pydssp to use the native pydssp numpy version. For me it would be sufficient to have this in the docs and then I wouldn't need a pip extra section, unless we're doing one that installs every optional dependency anywhere in MDA.... maybe that would be useful, just so that it is possible to do a "batteries included" install.

Feb 06 '24 22:02 orbeckst

mdanalysis mdanalysis copied to clipboard

Feature/dssp

PR Checklist

Developers certificate of origin

Comment last updated at 2024-06-13 05:15:32 UTC

Linter Bot Results:

Codecov Report

mdanalysis
mdanalysis copied to clipboard