mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

RDKit Descriptors and fingerprints wrapper

Open cbouy opened this issue 4 years ago • 14 comments

Part of the fixes for #2468 Depends on #2775

Changes made in this Pull Request:

  • RDKit descriptors can be calculated from the RDKitDescriptors class, as a subclass of AnalysisBase
  • Fingerprints are available through a function

Quick example image

I'm not sure if a dict for the results is ideal, as you can see passing several lambda functions will be problematic so maybe a list of (function_name, value) tuples is better. Or we just don't care about lambdas ?

PR Checklist

  • [ ] Tests?
  • [ ] Docs?
  • [ ] CHANGELOG updated?
  • [x] Issue raised/referenced?

Ping @fiona-naughton @IAlibay @richardjgowers

cbouy avatar Aug 17 '20 18:08 cbouy

Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 25:80: E501 line too long (83 > 79 characters) Line 29:12: W291 trailing whitespace Line 30:57: W291 trailing whitespace Line 60:80: E501 line too long (80 > 79 characters) Line 131:80: E501 line too long (92 > 79 characters) Line 133:80: E501 line too long (88 > 79 characters) Line 134:80: E501 line too long (99 > 79 characters) Line 135:80: E501 line too long (112 > 79 characters) Line 138:80: E501 line too long (81 > 79 characters) Line 214:70: W291 trailing whitespace Line 223:68: W291 trailing whitespace Line 225:1: W293 blank line contains whitespace

Line 123:1: E302 expected 2 blank lines, found 1 Line 141:80: E501 line too long (83 > 79 characters) Line 144:80: E501 line too long (93 > 79 characters) Line 148:80: E501 line too long (87 > 79 characters) Line 171:80: E501 line too long (81 > 79 characters) Line 182:80: E501 line too long (80 > 79 characters) Line 189:80: E501 line too long (87 > 79 characters) Line 190:80: E501 line too long (92 > 79 characters)

Comment last updated at 2021-04-28 18:40:44 UTC

pep8speaks avatar Aug 17 '20 18:08 pep8speaks

Codecov Report

Merging #2912 (37755d7) into develop (a23b1e9) will increase coverage by 0.02%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2912      +/-   ##
===========================================
+ Coverage    92.83%   92.85%   +0.02%     
===========================================
  Files          170      171       +1     
  Lines        22809    22882      +73     
  Branches      3242     3260      +18     
===========================================
+ Hits         21174    21247      +73     
  Misses        1587     1587              
  Partials        48       48              
Impacted Files Coverage Δ
package/MDAnalysis/analysis/__init__.py 100.00% <ø> (ø)
package/MDAnalysis/analysis/RDKit.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a23b1e9...37755d7. Read the comment docs.

codecov[bot] avatar Aug 17 '20 19:08 codecov[bot]

I changed the results attribute to store an array of dict for each frame, and I added a static method to list available descriptors

cbouy avatar Aug 18 '20 16:08 cbouy

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption. RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times. I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

cbouy avatar Aug 20 '20 13:08 cbouy

Regarding the output given to users, I'm still scratching my head over 2 things:

For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption. RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times. I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?

For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.

Maybe for simplicity's sake, we could just return fingerprints and descriptors in the exact same way RDKIT does (i.e. just return the object) for each frame (as a list)?

IAlibay avatar Aug 20 '20 13:08 IAlibay

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it. Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

For descriptors, yeah it's basically just the value for each frame so I guess I'll simply make a 2D array out of that.

cbouy avatar Aug 20 '20 14:08 cbouy

I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.

I'm not sure, although a general purpose would be more "user friendly", at the end of the day I feel like users should be sufficiently versed in RDKit if they are looking to be using these options. To me, offering an object conversion layer not only leads to more work, but eventually more confusion for users. I'll ping @fiona-naughton and @richardjgowers though, who might have different views here.

Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).

Yep, that makes sense.

IAlibay avatar Aug 22 '20 15:08 IAlibay

@cbouy if you can update this against develop I'll put this next up on my review list.

IAlibay avatar Apr 22 '21 20:04 IAlibay

@IAlibay just updated, looks like the error is related to #2958 "as usual" 😅

cbouy avatar Apr 29 '21 08:04 cbouy

@IAlibay this is still listed for 2.0 — is this realistic and essential?

@cbouy how much needs to be done here to complete it?

orbeckst avatar May 30 '21 02:05 orbeckst

@IAlibay this is still listed for 2.0 — is this realistic and essential?

Essential - probably not. I added in all the RDKits because it would be really great to get them out there. I'll let @cbouy speak on how realistic this is (for this and the other opened RDKit PRs).

If it helps, I'd be happy with a relatively rapid 2.1.0 release that adds new RDKit features.

IAlibay avatar Jun 01 '21 11:06 IAlibay

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0 In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

cbouy avatar Jun 01 '21 12:06 cbouy

I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0 In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.

@cbouy for the PRs that are opened, which ones (if any) are critical for 2.0?

IAlibay avatar Jun 01 '21 12:06 IAlibay

Critical PRs would be #3044 and #3324 #3325 would be nice to have in 2.0 as well #2912 (this one) can come later and finally #2900 which is very optional (and possibly abandoned?)

cbouy avatar Jun 01 '21 12:06 cbouy