mdanalysis
mdanalysis copied to clipboard
Feature/dssp
Fixes #1612
Changes made in this Pull Request:
- introduces
MDAnalysis.analysis.dssp.DSSPclass for secondary structure analysis, using code implemented inpydssppackage available for secondary structure annotation - adds tests from pydssp package
PR Checklist
- [x] Tests?
- [x] Docs?
- [x] CHANGELOG updated?
- [x] Issue raised/referenced?
Developers certificate of origin
- [x] I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.
:books: Documentation preview :books:: https://mdanalysis--4304.org.readthedocs.build/en/4304/
Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
package/MDAnalysis/analysis/dssp/dssp.py:
Line 2:80: E501 line too long (90 > 79 characters) Line 14:80: E501 line too long (87 > 79 characters) Line 16:80: E501 line too long (115 > 79 characters) Line 27:80: E501 line too long (83 > 79 characters) Line 46:80: E501 line too long (80 > 79 characters) Line 62:80: E501 line too long (80 > 79 characters) Line 71:80: E501 line too long (84 > 79 characters) Line 78:80: E501 line too long (83 > 79 characters) Line 80:80: E501 line too long (84 > 79 characters) Line 82:80: E501 line too long (80 > 79 characters) Line 115:1: W293 blank line contains whitespace Line 116:59: W291 trailing whitespace Line 117:80: E501 line too long (85 > 79 characters) Line 118:80: E501 line too long (81 > 79 characters) Line 118:82: W291 trailing whitespace Line 119:80: E501 line too long (84 > 79 characters) Line 120:80: E501 line too long (83 > 79 characters) Line 123:1: W293 blank line contains whitespace Line 125:80: E501 line too long (86 > 79 characters) Line 125:87: W291 trailing whitespace Line 127:80: E501 line too long (84 > 79 characters) Line 127:85: W291 trailing whitespace Line 128:80: E501 line too long (80 > 79 characters) Line 128:81: W291 trailing whitespace Line 132:1: W293 blank line contains whitespace Line 135:1: W293 blank line contains whitespace Line 136:80: E501 line too long (85 > 79 characters) Line 137:80: E501 line too long (81 > 79 characters) Line 199:75: W291 trailing whitespace Line 228:80: E501 line too long (80 > 79 characters) Line 232:78: W291 trailing whitespace Line 233:80: W291 trailing whitespace Line 234:80: E501 line too long (83 > 79 characters) Line 266:72: W291 trailing whitespace Line 273:80: E501 line too long (83 > 79 characters) Line 299:80: E501 line too long (86 > 79 characters) Line 305:80: E501 line too long (82 > 79 characters) Line 312:80: E501 line too long (81 > 79 characters) Line 312:82: W291 trailing whitespace Line 313:80: E501 line too long (81 > 79 characters) Line 322:80: E501 line too long (91 > 79 characters) Line 323:80: E501 line too long (87 > 79 characters) Line 332:80: E501 line too long (91 > 79 characters) Line 361:80: E501 line too long (81 > 79 characters) Line 392:80: E501 line too long (87 > 79 characters)
- In the file
package/MDAnalysis/analysis/dssp/pydssp_numpy.py:
Line 68:80: E501 line too long (85 > 79 characters) Line 157:80: E501 line too long (84 > 79 characters) Line 179:80: E501 line too long (83 > 79 characters) Line 206:80: E501 line too long (80 > 79 characters) Line 207:80: E501 line too long (81 > 79 characters)
- In the file
testsuite/MDAnalysisTests/analysis/test_dssp.py:
Line 10:80: E501 line too long (81 > 79 characters) Line 11:80: E501 line too long (85 > 79 characters) Line 12:80: E501 line too long (82 > 79 characters) Line 30:80: E501 line too long (80 > 79 characters) Line 42:80: E501 line too long (80 > 79 characters) Line 52:80: E501 line too long (80 > 79 characters) Line 55:80: E501 line too long (82 > 79 characters)
Comment last updated at 2024-06-13 05:15:32 UTC
Linter Bot Results:
Hi @marinegor! Thanks for making this PR. We linted your code and found the following:
Some issues were found with the formatting of your code.
| Code Location | Outcome |
|---|---|
| main package | ⚠️ Possible failure |
| testsuite | ⚠️ Possible failure |
Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9494112403/job/26163914658
Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!
There's a problem I don't know the workaround for yet: prolines.
In the original pydssp following happens:
- atoms get parsed from pdb regardless of residue type
- hydrogens get guilt automatically
- based on that, secondary structure is assigned
However, in my implementation all prolines (and N-terminal residue) are assigned '-' (=unstructured loop).
This is likely less correct than original implementation, but I'm not sure how to fix that. My suggestion would be to:
- if
guess_hydrogens=False, use positions of existing hydrogens and fill in fake proline hydrogens withpydssp - if
guess_hydrogens=True, default topydsspbehaviour with fully automatic hydrogens
The prolines problem is now dealt with, exactly as I described above -- if guess_hydrogens=False, it would take hydrogens from atoms that have them, and otherwise (=prolines & N-term) guess their positions. Otherwise, would guess all the positions.
I surprisingly couldn't fix the sdist_check_and_build job, even though I added einops (no runtime dependency numpy wrapper) to the install_requires in setup.py. If anyone could help, I'd be happy to proceed with the PR :)
I surprisingly couldn't fix the
sdist_check_and_buildjob, even though I addedeinops(no runtime dependency numpy wrapper) to theinstall_requiresinsetup.py. If anyone could help, I'd be happy to proceed with the PR :)
I'll have a look at packaging later this week but I would ask that einops not be a core dependency - imports should be optional for it (for now wrap with try/except, I'll try to get moving with my optional imports PR soon).
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.49%. Comparing base (
45fb664) to head (d3d2cf2). Report is 4 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #4304 +/- ##
===========================================
- Coverage 93.64% 89.49% -4.16%
===========================================
Files 168 183 +15
Lines 21254 22294 +1040
Branches 3918 3427 -491
===========================================
+ Hits 19904 19951 +47
- Misses 892 1893 +1001
+ Partials 458 450 -8
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I have a fundamental question -- there are no tests in pydssp for the core functions, like hbond_map etc, are technically not covered by the tests --> codecov isn't happy with the PR.
What should I do about that? Here I kinda relied on pyDSSP to implement things right, and checked with the secondary structure assignments directly, without checking the core numpy functions for that. Should I still do that (which would slow down the PR tremendously since I've no idea how to do that)?
@marinegor I've not had time to look at this PR properly yet, but are you vendoring a bunch of code from elsewhere? If so then yes tests would be needed - otherwise there's no way to know if we're getting regressions.
That being said - if we are vendoring code, the question would be why and can we avoid that? If something upstream changes / is fixed, how would we know the fix happened?
@IAlibay
but are you vendoring a bunch of code from elsewhere?
Yes -- I took MIT licensed pydssp (https://github.com/ShintaroMinami/PyDSSP) and inserted its functions into MDAnalysis. The reason I didn't include it as a dependency is that the original package is small but includes both pytorch and numpy implementations, which would require pytorch as dependency here.
if we are vendoring code, the question would be why
because I don't know how to write secondary structure implementation myself with proper tests (all but the secondary structure itself)
If something upstream changes / is fixed, how would we know the fix happened?
There is no way indeed.
One option probably would to fork pydssp, remove pytorch as dependency, and include forked repo as dependency here? Not sure if it helps much though.
I see, I wouldn't necessarily be advocating for maintaining a fork of some other repo - that invites a lot more maintenance burden.
- Is there any way to convince upstream to make the dependency move themselves?
- Could this not be an MDAKit instead? If we're worried about upstream dependencies then that's really what kits are for.
pinging @MDAnalysis/coredevs for opinions
Is there any way to convince upstream to make the dependency move themselves?
I raised an issue on their repo in september, doesn't seem to be noted yet.
Could this not be an MDAKit instead? If we're worried about upstream dependencies then that's really what kits are for.
I'm not sure if we want to make protein secondary structure a separate MDAKit, it seems quite useful in the core analysis.
I'm not sure if @IAlibay 's ping worked, so raising this again.
Thanks @marinegor, pinging @MDAnalysis/coredevs again - please weigh in on this.
@marinegor this is a tricky one.
These are my personal opinions.
I would say that we probably shouldn't vendor a bunch of code from elsewhere. I would suggest a combination of what @IAlibay suggests and what you suggest.
- Try and convince upstream to make optional dependency (you can contact them on an issue or email or PR) likely an uphill fight though so I wouldn't waste too much time.
- As PyDSSP can be an optional dependency, IMO its no stress if it has some heavy dependencies itself (query @IAlibay), the user / we accept this as an issue only if you use this module if you know what I mean?
- I am fine with it in core or as a kit, I think SSP is a big missing analysis a lot of people would use from core.
Sorry for slow reply.
I should add that I can work to get pydssp on conda-forge if required.
We are talking about https://github.com/ShintaroMinami/PyDSSP/blob/master/pydssp/pydssp_numpy.py, right? 108 lines of code. That's less than pyqcprot, which we fully vendored (a long time ago).
In principle I agree with the idea that we don't want to be in the business of maintaining other code... except when it's really important for us. DSSP is missing and it should be a core analysis. That does not preclude kits that may want to do fancy things or support all of the original DSSP classifiers, but we should ship a minimal base version that works out of the box.
In my view we can either make pydssp a core dep (seems like a lot of work and given the lack of action on the PyDSSP repo we will probably effectively maintaining the package) or we vendor pydssp_numpy.py and run the tests that @marinegor already wrote.
I'd be in favor of the second solution but I would not include the pydssp_numpy.py code into dssp.py as done at the moment. Instead I would create a subpackage analysis/dssp (similar to analysis/hydrogenbonds) and drop dssp.py and pydssp_numpy.py into that subpackage and keep them as separate as possible. I'd write dssp.py to use an installed pydssp if available and otherwise fall back to the vendored pydssp_numpy.py.
@orbeckst I like the sound of your second solution, I think covers all bases while maintaining lots of the great work done in this PR.
@orbeckst just to check - by suggesting that we vendor the code, are you saying that we should be making einops a core dependency of MDAnalysis? Having vendored code with optional deps is possible, but maybe breaks the point of trying to avoid an optional dep with another optional dep?
If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.
If we were to use einops more in other parts (which we could do, I think it's cool) then I don't think it would be a burden: when I looked at its pyproject.toml briefly it seemed to have no deps and treat every backend as optional (which may be an interesting trick for us to learn). However, until that's the case, keeping it optional seems totally fine to me.
thanks everyone for your input -- point by point below
@hmacdope
Try and convince upstream to make optional dependency (you can contact them on an issue or email or PR) likely an uphill fight though so I wouldn't waste too much time.
From what I'm seeing, pydssp is simply a weekend project that, even if we convince them to make changes, won't be supported (all main commits in 2 days, and then 1 fix 3 months after 2 years ago).
@orbeckst
In principle I agree with the idea that we don't want to be in the business of maintaining other code... except when it's really important for us. DSSP is missing and it should be a core analysis. That does not preclude kits that may want to do fancy things or support all of the original DSSP classifiers, but we should ship a minimal base version that works out of the box.
I support this -- the existing ~100 lines of code that I took from there doesn't do anything fancy (e.g. doesn't distinguish 3-10 vs pi helices, or different beta-sheets), but is a good basis for their implementation. It can be implemented in future even within mdanalysis, but it's unlikely that the author of the original package will do that.
I'd be in favor of the second solution but I would not include the pydssp_numpy.py code into dssp.py as done at the moment. Instead I would create a subpackage analysis/dssp (similar to analysis/hydrogenbonds) and drop dssp.py and pydssp_numpy.py into that subpackage and keep them as separate as possible. I'd write dssp.py to use an installed pydssp if available and otherwise fall back to the vendored pydssp_numpy.py.
That seems like a viable option to me.
If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.
I don't think there are any workarounds that allow running dssp without einops -- I haven't changed the code, so it's still dependent on their core functions. But I do agree that they're cool, and having them as a dependency isn't a burden simply by their design. And also they're indeed a library, and likely will be supported & tested, in contrast to pydssp, so having it as a dependency is certainly a better idea @IAlibay
To sum up, I'll try to
- [x] move
analysis/dssp.pyinto a separate submodule - [x] and re-write it to use installed
pydsspif possible, or use built-in functions vendored from there, if not - [x] try to get of
einopsif possible (i'll anyway be writing some tests for the helper functions, and probably will come up with something in the meantime)
If we don't want to use einops then we could keep @marinegor 's workaround. I am not saying that we must keep pydssp_numpy.py bit-identical to the original, just as little changed as possible.
If we were to use einops more in other parts (which we could do, I think it's cool) then I don't think it would be a burden: when I looked at its pyproject.toml briefly it seemed to have no deps and treat every backend as optional (which may be an interesting trick for us to learn). However, until that's the case, keeping it optional seems totally fine to me.
@orbeckst the main issue here is how well maintained that package is - even pure python packages go out of maintenance, see py3.12 and how breaking it was. Given that their CI doesn't even have an entry for py3.12 I'd be rather careful here.
the main issue here is how well maintained that package is - even pure python packages go out of maintenance, see py3.12 and how breaking it was. Given that their CI doesn't even have an entry for py3.12 I'd be rather careful here.
Fair point, although pure Python/noarch is probably still the one that is easiest to catch up. (Maybe make a PR for their CI to include 3.12.)
I don't think there are any workarounds that allow running dssp without einops -- I haven't changed the code, so it's still dependent on their core functions.
Thanks for the clarification (EDIT: Originally I thought that the dssp code could work without einops but I hadn't looked carefully enough): The current DSSP implementation requires einops for repeat and rearrange. From a quick look at https://einops.rocks/#api , under numpy they correspond to
repeat->np.tile(x, ...)rearrange->x.view(x.shape[0], -1)(depending on what you wanted to rearrange)
although changing py_dssp.py to fall back to numpy versions if einops doesn't exist will likely be pretty ugly. Then the question would be if it's even worth keeping einops around — I don't know if performance is an issue here, but readability is.
It sounds more sane to me to keep py_dssp.py around as written. Opinions?
Ultimately this is coming down to how we would want to handle the einops dependency. The safer approach is as an optional dependency (than as a core dep). I think that's ok for something in analysis. If we get feedback from users who are annoyed/confused if dssp doesn't work out of the box then we can rethink.
@IAlibay — if we make einops optional (in pyproject.toml and the pypi package), would we still make it a normal dependency in the conda package? How are we currently doing this for other optional dependencies?
just to check - by suggesting that we vendor the code, are you saying that we should be making einops a core dependency of MDAnalysis? Having vendored code with optional deps is possible, but maybe breaks the point of trying to avoid an optional dep with another optional dep?
Basically yes (EDIT: or no?): Vendor py_dssp.py (which is from a package that is not continuously maintained) and make it optionally dependent on einops (which looks pretty well maintained). (EDIT: so not make einops are core dependency, along the line of keeping dependencies slim)
EDIT: If someone writes a cython version for dssp then we can eventually slot that in to replace py_dssp.py but in order to provide some version of dssp this looks good enough to me.
@orbeckst
although changing py_dssp.py to fall back to numpy versions if einops doesn't exist will likely be pretty ugly. Then the question would be if it's even worth keeping einops around — I don't know if performance is an issue here, but readability is.
I don't think there is much performance issue -- seems like under the hood einops just comes up with a pure numpy function based on tile/einsum/etc, just the basic set of functions is slightly different from framework to framework.
In principle, it could be possible to get the intermediate representation -- I asked a question on their discussions about it. We'll see how it goes.
UPD: wow that was fast (8 minutes) -- there's no obvious way to "decompile" numpy backend, but you can do that with jax backend, and kinda guess how the numpy function would look like. This, combined with the fact that most of rearrange/repeat uses are associated with batching/re-batching, gives me hope that I can re-write that cleaner and without dependencies.
It sounds more sane to me to keep py_dssp.py around as written. Opinions?
To be honest, most of the einops operations are actually related to the fact that it's batched, i.e. allows for computation for multiple structures at a time, either with numpy or pytorch. If we remove that, the code will likely benefit from that both in readability and dependendability (?) -- einops won't be an issue anymore.
In my personal opinon einops is popular enough (7.6k stars, 20k dependents) to just have as a dependency.
Fair point, although pure Python/noarch is probably still the one that is easiest to catch up. (Maybe make a PR for their CI to include 3.12.)
This would be a good idea, there haven't been releases since October 23 and I don't see any mentions of Py3.12. We should at the very least check that sdist builds on py3.12.
Hi @MDAnalys/coredevs, re-writing with numpy turned out not that a big deal -- see the updated version.
I rewrote everything so that it uses pydssp if it's installed, otherwise falls back to my implementation, and added pydssp to pyproject.toml.
Now I have one issue with the code: if someone decides to do pip install MDAnalysis[analysis], they will automatically pull pydssp, hence automatically pull torch, which is kinda almost 1 Gb. I really don't like this, since torch isn't used in this implementation at all.
I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.
I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.
You could make a special sub-section, something along the lines of extras or whatever fancy thing you want - then you can have it in pyproject.toml.
I would probably suggest not having pydssp as a dependency in pyproject.toml, but leave the code as is -- i.e. if the user already has pydssp, it'll be used.
I agree that we shouldn't pull pydssp under normal circumstances.
You can do what @IAlibay suggests.
Perhaps also add a section to the docs stating explicitly that you can pip install pydssp to use the native pydssp numpy version. For me it would be sufficient to have this in the docs and then I wouldn't need a pip extra section, unless we're doing one that installs every optional dependency anywhere in MDA.... maybe that would be useful, just so that it is possible to do a "batteries included" install.