mdanalysis
mdanalysis copied to clipboard
modernize HBA to use AG as objects internally instead of selection strings
Is your feature request related to a problem?
modernize HBA (HydrogenBondAnalysis) to use AG (AtomGroups) as objects internally instead of selection strings to avoid the need for HBA analysis to discern the relevant atoms.
Describe the solution you'd like
AG as objects internally instead of selection strings
Additional context
See PR Hbond update #3848 (in particular https://github.com/MDAnalysis/mdanalysis/pull/3848#discussion_r1014548556)
Guidance for new contributors
This is a reasonably complicated issue to solve for a beginner. We recommend you first learn about the problem. Ask questions here. The more specific your question, then better we can help.
As the first steps you should:
- Familiarize yourself with AtomGroups and UpdatingAtomGroups. Understand the difference between the two.
- Understand the difference between using selection strings to describe groups of atoms and using AtomGroups for the same purpose.
- Read the code for HydrogenBondAnalysis in package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py. Be able to identify the pieces of code that use selection strings.
EDITS
- made issue clearer for new contributors (@orbeckst )
Can we offer this issue to GSOC applicants? Opinion @hmacdope ?
Why not, would be a good fix.
Context: came up with https://github.com/MDAnalysis/mdanalysis/pull/3848#discussion_r1014548556
Hi so i read the HydrogenBondAnalysis code and find this piece of code using selection string
import MDAnalysis from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA u = MDAnalysis.Universe(psf, trajectory) hbonds = HBA(universe=u) hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein") hbonds.acceptors_sel = hbonds.guess_acceptors("protein") hbonds.run()
To convert the selected string into AtomGroup Object,we can change this code like this import MDAnalysis from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA u = MDAnalysis.Universe(psf, trajectory) hbonds = HBA(universe=u) hydrogens = u.select_atoms("protein") acceptors = u.select_atoms("protein*") hbonds.hydrogens_sel = hydrogens hbonds.acceptors_sel = acceptors hbonds.run()
So here i use select_atoms() function to convert selection string to atom group object.
So i m going to give pull request by converting selection string to atomgroup object in hbond_analysis.py,So if i am approaching correct. please merge my pull request so i have contribution for GSOC
@utkarsh147-del the select_atoms
function selects the specified atoms, while the hbonds.guess_hydrogens
and hbonds.guess_acceptors
functions guess which hydrogen atoms and which acceptor atoms should be used in the hydrogen bond analysis. In your suggested change, you are completely changing the meaning of the code, since you are using select_atoms("protein")
which selects all protein atoms. Have you tried to run the original code and your suggested change, and see what happens?
Additionally, you are suggesting to change code in the docstring. However, this issue concerns the changes to the code (not the documentation, which will nonetheless need to reflect the changes in the code) needed to use atom groups internally instead of selection strings. The fact that you identified hbonds.guess_hydrogens
and hbonds.guess_acceptors
as returning a selection string instead of an AtomGroup object should give you a hint on where to look further.
In addition to the first steps described above, please familiarise yourself with the HydrogenBondAnalysis
class and its use (you can have a look at the user guide), to make sure you understand what it does, before trying to identify pieces of code that use selection strings that will need to be changed to use AtomGroups.
Finally, please use Code and Syntax Highlighting when writing code on GitHub, which makes it far more readable.
Do not hesitate to ask questions and clarifications, before jumping into opening a PR.
@utkarsh147-del the
select_atoms
function selects the specified atoms, while thehbonds.guess_hydrogens
andhbonds.guess_acceptors
functions guess which hydrogen atoms and which acceptor atoms should be used in the hydrogen bond analysis. In your suggested change, you are completely changing the meaning of the code, since you are usingselect_atoms("protein")
which selects all protein atoms. Have you tried to run the original code and your suggested change, and see what happens?Additionally, you are suggesting to change code in the docstring. However, this issue concerns the changes to the code (not the documentation, which will nonetheless need to reflect the changes in the code) needed to use atom groups internally instead of selection strings. The fact that you identified
hbonds.guess_hydrogens
andhbonds.guess_acceptors
as returning a selection string instead of an AtomGroup object should give you a hint on where to look further.In addition to the first steps described above, please familiarise yourself with the
HydrogenBondAnalysis
class and its use (you can have a look at the user guide), to make sure you understand what it does, before trying to identify pieces of code that use selection strings that will need to be changed to use AtomGroups.Finally, please use Code and Syntax Highlighting when writing code on GitHub, which makes it far more readable.
Do not hesitate to ask questions and clarifications, before jumping into opening a PR.
Hi,I understand what you want to convey to me except one thing what do you mean by docstring here,I try to make changes in code which the hbond.py link given>Do we need to make this change in that file or some another python fie.Please clarify me this
@utkarsh147-del by docstring I mean the Python docstring of the hbond_analysis.py
module. This is what you are suggesting to change in your comment. Therefore, the commit on your fork of MDAnalysis only changes such doctoring, which is what constitutes the documentation of the hbond_analysis.py
module.
hbond_analysis.py
is indeed the file that needs to be changed, but the required changes consist in using AtomGroups internally (within the classes and methods of this module) instead of selection strings, not just changing the doctoring/documentation.
I hope this clarifies things.
Hi @hmacdope @orbeckst, I would like to try this issue. Is it possible for me to work on this one?
@SophiaRuan I don't see a linked PR so if you submit a PR that references this issue, we'll review it.
Hi @orbeckst @hmacdope, I have been working on this issue recently. Do we need to change these parameters (donors_sel
, hydrogens_sel
, acceptors_sel
, between
) which have selection strings as input to Atomgroups as input and change the code accordingly so that it will take ag without any bugs?
No, for right now, input selection strings are ok. If you follow https://github.com/MDAnalysis/mdanalysis/pull/3848#discussion_r1014548556 then you see it's about internal use of selection strings.
I have been working on this issue recently
I don't see a linked PR. When you create a PR, make sure that you include issue number #3933 in the Fixes #3933
field of the template PR comment.
Can we offer this issue to GSOC applicants? Opinion @hmacdope ?
No, for right now, input selection strings are ok. If you follow #3848 (comment) then you see it's about internal use of selection strings.
I have been working on this issue recently
I don't see a linked PR. When you create a PR, make sure that you include issue number #3933 in the
Fixes #3933
field of the template PR comment.
For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue
If you have a pull request with code up then make sure that the first line of your comment reads
Fixes #3933
i.e., fill in the template that is shown to you when you create the PR.
This will link your PR to this issue. And unless we see this link, we will probably not see or review a PR.
@yickysan
For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue
That's not required and in fact changes the user interface. We do not change the user interface lightly because it can create incompatibilities. It's better to focus a PR on the smallest amount of changes necessary to solve the issue. If you feel that the UI should be changed, raise a specific issue for that case and then it can be discussed and addressed separately.
In order to keep work (and reviewing) manageable, we very much try to keep PRs as small and as self-contained as possible.
@yickysan
For my solution I made the HydrogenBondAnalysis class also accept a list of atom indices and positions as an alternative to selecting atoms via string. I don't know if this solves the issue
That's not required and in fact changes the user interface. We do not change the user interface lightly because it can create incompatibilities. It's better to focus a PR on the smallest amount of changes necessary to solve the issue. If you feel that the UI should be changed, raise a specific issue for that case and then it can be discussed and addressed separately.
In order to keep work (and reviewing) manageable, we very much try to keep PRs as small and as self-contained as possible.
Thank you for reaching back. The new changes I've made, if there isn't a specific select string, the methods should use ag.atoms
instead of `ag.select_atoms("all"). This will only be used when we have soemthing like "resname .... and name ..."
@yickysan , if you show your code in a PR we can comment on it; discussions with code present are more productive
I would like to work on this issue for my GSoC application @hmacdope @orbeckst. I am thankfully familiar with both selections strings and AtomGroups from being an MDAnalysis user. I will start tackling the internal use of selection strings and converting them over to AtomGroups. I'll update this issue thread with any questions I have as they rise up.
Sounds good,
@orbeckst @IAlibay we will need to deprecate the old API as well yes?
I am going to jot down my rough plan here before I get started on digging around in the code, to make sure I have an appropriate plan. If I disregard guess_acceptors
, guess_donors
, and guess_hydrogens
, I can't see an awful lot of selection string usage, but from what I can find here are my ideas.
- Change return types of
guess_acceptors
,guess_donors
, andguess_hydrogens
to AtomGroups instead of selection strings (doctoring examples will need to be updated to reflect this), as well as their internal code that uses selection strings more than once. As part of this (or for a later issue), should we create some kind of setter methods for changing these properties (rather than allowing users to just access and change the stored selection strings, as per some of the docstring examples)? - Update
_prepare
, as will no longer need to use a selection string to get the acceptors AtomGroup - Update
_get_dh_pairs
to use acceptors and hydrogens AtomGroups defined in_prepare
, instead of selection strings. Only donor's will still need a selection string as they must be matched to the hydrogen atom groups, so bonding can still be seen by "zipping" atom groups
Please let me know if I am missing anything crucial, otherwise I will get started tomorrow 😊
Sounds good,
@orbeckst @IAlibay we will need to deprecate the old API as well yes?
Yeah we should keep it around until 3.0, it's a highly used method and folks won't be too happy if the API changes on them.
I would also have a question regarding this issue.
Since as @lunamorrow the main cases where string selection is used are guess_acceptors
, guess_donors
and guess_hydrogens
, if we return them as AtomGroups it would allow us to use them later on internally as AtomGroups in _prepare
.
As far as i understood you would want to retain the user input, so that he could string selection as an input for HBA and don't change the input of the user to AtomGroups, which means _prepare
gets the information either through guess_acceptors
as an AtomGroup or from the user as a selection string, which could lead to a problem in the case of the selection, where the user would access the guess_acceptors
to get the acceptors and then add manually additional cases for example:
import MDAnalysis as mda
u = mda.Universe(PSF,DCD)
hbonds = HydrogenBondAnalysis(universe=u)
protein_acceptors_sel = hbonds.guess_acceptors("protein")
water_acceptors_sel = "resname TIP3 and name OH2"
hbonds.acceptors_sel = f"({protein_acceptors_sel}) or ({water_acceptors_sel} and around 10 not resname TIP3)"
which would deliver you hbonds.acceptors_sel
as a combination of the AtomGroup from guess_acceptors
and a string from water_acceptors_sel
.
It should be possible to put a condition that would recognize such cases and split up then the input of the user into parts and them treat them separately i think.
I am going to jot down my rough plan here before I get started on digging around in the code, to make sure I have an appropriate plan. If I disregard
guess_acceptors
,guess_donors
, andguess_hydrogens
, I can't see an awful lot of selection string usage, but from what I can find here are my ideas.
- Change return types of
guess_acceptors
,guess_donors
, andguess_hydrogens
to AtomGroups instead of selection strings (doctoring examples will need to be updated to reflect this), as well as their internal code that uses selection strings more than once.
Yep makes sense to me.
As part of this (or for a later issue), should we create some kind of setter methods for changing these properties (rather than allowing users to just access and change the stored selection strings, as per some of the docstring examples)?
The stored selection strings will be replaced with stored atom groups, but the user should still be able to set them IMO (to an atom group)
- Update
_prepare
, as will no longer need to use a selection string to get the acceptors AtomGroup Yep- Update
_get_dh_pairs
to use acceptors and hydrogens AtomGroups defined in_prepare
, instead of selection strings. Only donor's will still need a selection string as they must be matched to the hydrogen atom groups, so bonding can still be seen by "zipping" atom groups
Should be able to check the indices on two atom groups, but lets see how we go.
Please let me know if I am missing anything crucial, otherwise I will get started tomorrow 😊
You will need to do 2x PRs, 1 to deprecate current API, adding a deprecation warning, and the other to update the functionality. The update to AGs will then need to be held until the next major release cycle. This all still counts for GSOC etc even if PR not merged.
I would also have a question regarding this issue.
Since as @lunamorrow the main cases where string selection is used are
guess_acceptors
,guess_donors
andguess_hydrogens
, if we return them as AtomGroups it would allow us to use them later on internally as AtomGroups in_prepare
.
Correct
As far as i understood you would want to retain the user input, so that he could string selection as an input for HBA and don't change the input of the user to AtomGroups, which means
_prepare
gets the information either throughguess_acceptors
as an AtomGroup or from the user as a selection string, which could lead to a problem in the case of the selection, where the user would access theguess_acceptors
to get the acceptors and then add manually additional cases for example:
import MDAnalysis as mda
u = mda.Universe(PSF,DCD)
hbonds = HydrogenBondAnalysis(universe=u)
protein_acceptors_sel = hbonds.guess_acceptors("protein")
water_acceptors_sel = "resname TIP3 and name OH2"
hbonds.acceptors_sel = f"({protein_acceptors_sel}) or ({water_acceptors_sel} and around 10 not resname TIP3)"
which would deliver you
hbonds.acceptors_sel
as a combination of the AtomGroup fromguess_acceptors
and a string fromwater_acceptors_sel
.It should be possible to put a condition that would recognize such cases and split up then the input of the user into parts and them treat them separately i think.
Yes, either we will need to insist on all AtomGroups
for all selections or have functionality that uses and isinstance
check to handle the overload of the two. In general we are trying to move towards AtomGroups
as the functionality that we pass around, so that would be my preference, but we can explore options.
@hmacdope Thanks for the reply.
I was tinkering with the guess_acceptors
to see if i can convert them to AtomGroups and also then adjust the user input, so that it wouldn't lead to problems regarding the user input.
For the cases where the user is only using either a selection string or an AtomGroup as input it is fine and possible to differentiate between them with isinstance
, although i did use type
for it, but the difficulty is if they are combined. I managed to get them through specific conditions, but it is too condition heavy i would say and definitely not the optimal way.
I only did it for the acceptors and since me and @lunamorrow are both interested in addressing this issue and both applied for GSOC if it would be fine that i purpose a partial enhancement with the change of acceptors to AG and she does the other parts of the code? although i would say that this issue will anyway not be merged till the deadline, since it could quite delicate with the adjustment of the user input.
@talagayev, I appreciate that GSOC starter issues are in short supply, however given that the work you propose is within the scope of that already proposed by @lunamorrow, I would say they have first chance at making the changes (unless you two can come to an agreement, but I would advise against this as two split PRs will be more compilicated).
For other issues to tackle I would suggest https://github.com/MDAnalysis/mdanalysis/issues/3937 or https://github.com/MDAnalysis/mdanalysis/issues/3811
@hmacdope i agree, since the code would be in the same direction as proposed by @lunamorrow and splitting up the PRs could be tricky, it would be only fair that she works on this issue, while i would focus on the issues that you suggested. I would probably start with #3811.
I've just finished a big day of uni, so I'll get started on these changes tomorrow now that @talagayev and I have a project each. I will make sure my PR addresses both my previous suggestions and the ones made by Valerij. To clarify/confirm, am I good to change selections across to AtomGroups (as the selection string API will be depreciated)? Therefore, we need to make sure the API reflects users being able to update selections by essentially adding a new AtomGroup (rather than passing a selection string) and combining them with no duplications. Yes?
Thanks for the clarification and direction @hmacdope
The stored selection strings will be replaced with stored atom groups, but the user should still be able to set them IMO (to an atom group)
Ok this makes sense, as from my usage of MDAnalysis thus far it's quite easy and user-friendly to grab and manipulate stored variables.
Should be able to check the indices on two atom groups, but lets see how we go.
Oh yes true! I'll give that a go to really eliminate usage of selection strings :)
You will need to do 2x PRs, 1 to deprecate current API, adding a deprecation warning, and the other to update the functionality. The update to AGs will then need to be held until the next major release cycle. This all still counts for GSOC etc even if PR not merged.
I may need some help with the fiddly aspects of this, as I am very much figuring out how to do things, like making PRs, as I go. Are there any resources that you can point me towards, so that I can ensure I do this correctly from the get go?
@lunamorrow we the feature-branch
model detailed here: https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow
Basically,
- Make a fork of MDAnalysis
- Make a branch off
develop
in your fork (git checkout develop && git checkout -b my_cool_branch
) - Add and commit your changes (
git add hbonds.py
, `git commit -m 'changed hbonds') - Push your changes to your fork (
git push origin
) - Repeat as nessecary
- Make a pull request on the MDAnalysis
pull-requests
web interface.
@hmacdope Great, I have setup a fork on my GitHub repository already and created a new branch for the changes. I will work on the actual code change PR first and then push that, before I start on the PR for the API updates. I expect I will need some guidance with the API PR to ensure I match the format used by MDAnalysis for API depreciation warnings, but I will reach out when I get to this stage.
Just checking, will I need to make a PR with extra test cases in MDAnalysisTests? I expect we may need more for user interactions with guess_acceptors
, guess_donors
, and guess_hydrogens
, but I am not sure if this would already be covered by existing test cases?
Some more questions as I go:
Is there a reason that self.donors_sel
is not set in _prepare
the same way that self.hydrogens_sel
and self.acceptors_sel
are? self.donors_sel
is never set internally, so it will always be None unless it is passed in by the user during creation of a HydrogenBondAnalysis
Object. Is that what we want, or can I enable it to be set in _prepare
if it is not already defined during initialisation? I have noticed this same check if not (hasattr(self.u._topology, 'bonds') and len(self.u._topology.bonds.values) != 0):
is done in both guess_donors
and _guess_dh_pairs
, so to me it makes sense for me to condense this and simply set self.donors_sel
in _prepare
so the extra checks are avoided.