mdanalysis
mdanalysis copied to clipboard
RDKit Descriptors and fingerprints wrapper
Part of the fixes for #2468 Depends on #2775
Changes made in this Pull Request:
- RDKit descriptors can be calculated from the
RDKitDescriptors
class, as a subclass of AnalysisBase - Fingerprints are available through a function
Quick example
I'm not sure if a dict for the results is ideal, as you can see passing several lambda functions will be problematic so maybe a list of (function_name, value)
tuples is better. Or we just don't care about lambdas ?
PR Checklist
- [ ] Tests?
- [ ] Docs?
- [ ] CHANGELOG updated?
- [x] Issue raised/referenced?
Ping @fiona-naughton @IAlibay @richardjgowers
Hello @cbouy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
package/MDAnalysis/analysis/RDKit.py
:
Line 25:80: E501 line too long (83 > 79 characters) Line 29:12: W291 trailing whitespace Line 30:57: W291 trailing whitespace Line 60:80: E501 line too long (80 > 79 characters) Line 131:80: E501 line too long (92 > 79 characters) Line 133:80: E501 line too long (88 > 79 characters) Line 134:80: E501 line too long (99 > 79 characters) Line 135:80: E501 line too long (112 > 79 characters) Line 138:80: E501 line too long (81 > 79 characters) Line 214:70: W291 trailing whitespace Line 223:68: W291 trailing whitespace Line 225:1: W293 blank line contains whitespace
- In the file
testsuite/MDAnalysisTests/analysis/test_rdkit.py
:
Line 123:1: E302 expected 2 blank lines, found 1 Line 141:80: E501 line too long (83 > 79 characters) Line 144:80: E501 line too long (93 > 79 characters) Line 148:80: E501 line too long (87 > 79 characters) Line 171:80: E501 line too long (81 > 79 characters) Line 182:80: E501 line too long (80 > 79 characters) Line 189:80: E501 line too long (87 > 79 characters) Line 190:80: E501 line too long (92 > 79 characters)
Comment last updated at 2021-04-28 18:40:44 UTC
Codecov Report
Merging #2912 (37755d7) into develop (a23b1e9) will increase coverage by
0.02%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## develop #2912 +/- ##
===========================================
+ Coverage 92.83% 92.85% +0.02%
===========================================
Files 170 171 +1
Lines 22809 22882 +73
Branches 3242 3260 +18
===========================================
+ Hits 21174 21247 +73
Misses 1587 1587
Partials 48 48
Impacted Files | Coverage Δ | |
---|---|---|
package/MDAnalysis/analysis/__init__.py | 100.00% <ø> (ø) |
|
package/MDAnalysis/analysis/RDKit.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a23b1e9...37755d7. Read the comment docs.
I changed the results attribute to store an array of dict for each frame, and I added a static method to list available descriptors
Regarding the output given to users, I'm still scratching my head over 2 things:
For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption. RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times. I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?
For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change self._functions
to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.
Regarding the output given to users, I'm still scratching my head over 2 things:
For fingerprints, apart from hashed fp which have a predefined number of bits in length, and the MACCSKeys which is 167 bits long, it's usually a bad idea to convert them to an array as it will likely crash or hang forever because of memory consumption. RDKit typically stores them as a sparse vector which only knows which bit is activated and how many times. I'm wondering if a more appropriate output format could be a dict with only the bits that were set ?
For descriptors, maybe I can simply output an array of descriptor values instead of a dict or list of tuples with descriptors names ? The calculation will raise an error if a descriptor is not found and the descriptors are calculated in the order given by the user (if I change
self._functions
to a list instead of a dict). This way there's no risk with using lambdas or accidentally naming your function the same way as an RDKit descriptor.
Maybe for simplicity's sake, we could just return fingerprints and descriptors in the exact same way RDKIT does (i.e. just return the object) for each frame (as a list)?
I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it. Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).
For descriptors, yeah it's basically just the value for each frame so I guess I'll simply make a 2D array out of that.
I think the point of the wrapper is to make things easy for people who are not necessarily familiar with RDKit, and the fingerprint object can be a bit intimidating and depends on the type of fingerprint requested (some will output an ExplicitBitVect object, some different subtypes of SparseIntVect which behaves differently) so that's why I was proposing an optional general purpose output (dict or array), but if users want the RDKit object they can still have it.
I'm not sure, although a general purpose would be more "user friendly", at the end of the day I feel like users should be sufficiently versed in RDKit if they are looking to be using these options. To me, offering an object conversion layer not only leads to more work, but eventually more confusion for users. I'll ping @fiona-naughton and @richardjgowers though, who might have different views here.
Also I don't think any of the fingerprints I list here use 3D information so there's only one fp per atomgroup in the output (that's why it's in a function and not in an AnalysisBase object).
Yep, that makes sense.
@cbouy if you can update this against develop I'll put this next up on my review list.
@IAlibay just updated, looks like the error is related to #2958 "as usual" 😅
@IAlibay this is still listed for 2.0 — is this realistic and essential?
@cbouy how much needs to be done here to complete it?
@IAlibay this is still listed for 2.0 — is this realistic and essential?
Essential - probably not. I added in all the RDKits because it would be really great to get them out there. I'll let @cbouy speak on how realistic this is (for this and the other opened RDKit PRs).
If it helps, I'd be happy with a relatively rapid 2.1.0 release that adds new RDKit features.
I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0 In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.
I think this is the RDKit PR with the lowest priority and I don't mind if it's not in 2.0 In terms of remaining work it's mostly an API question for the rdkit fingerprint wrapper: do we bother keeping options to convert the rdkit fingerprints to more "traditional" formats (dict and numpy array) or do we only return the rdkit object ? There are different formats for the rdkit object depending on the fingerprint (explicit bit, sparse int, and others i think) that's why I implemented the conversion to dict and array in the first place to make it simpler to use.
@cbouy for the PRs that are opened, which ones (if any) are critical for 2.0?
Critical PRs would be #3044 and #3324 #3325 would be nice to have in 2.0 as well #2912 (this one) can come later and finally #2900 which is very optional (and possibly abandoned?)