crem
crem copied to clipboard
Parameters relevant for speed in mutate_mol
Hi! Thanks for maintaining this repo! I was wondering which parameters were relevant for speeding up mutate_mol
? I've found better speeds by increasing n_cores
but if I wanted to speed it up even more would decreasing radius
help too?
Hi! It depends what is the context of "speed up".
- I tried to make it fast but there is definitely space for improvement. However, each next improvement of speed require more efforts :)
- The only way how to truly increase the overall speed without code modifications is to increase the number of cores.
- But the speed can be increased by decreasing the number of replacements. If this is acceptable you may:
a) set
max_replacements
to 100 or 1000 (it will choose random replacements from all available) b) setmin_freq
to 5, 10 or more (to use only more frequently occurred replacements) c) increase theradius
(but this will make more conservative replacements) d) play with parameters controlling the size of replaced and replacing fragments (decreasing fragment size you will make smaller steps in chemical space) e) use a smaller fragment database I think that (a) is the most reasonable from this list followed by (b). (c)-(e) choices depend on a project.
Thanks! Feel free to close this issue. But on a related note what work do you think needs to be done to speed it up more? Would it be possible to leverage GPUs? Feel free to email me, the team I am working with would love to help!
I have to profile code to remember all details and pitfalls. I'll post those results here to enable a more objective discussion.
What I remember is that I implemented a brute force replacement procedure which makes multiple replacements leading to the same compound. Therefore, duplicates are generated and filtered internally. This is definitely waste of time. But I do not remember the proportion of duplicates and I do not know how to easily avoid this. For example, we have a compound X-CH2-CH3 and make replacement CH3>>NH2 and CH2-CH3>>CH2-NH2. They both will give X-CH2-NH2.
I'm not sure that GPU will help much, because the code extensively uses RDKit and there is no GPU support in RDKit (https://github.com/rdkit/rdkit/issues/3537) and separating GPU-friendly code seems too difficult.
From time to time I thought about substantial refactoring and change of the architecture and algorithms. But this will require a lot of efforts and may result in a new implementation :)
I put here results of tests and some conclusions. I currently do not consider speed up as a priority issue because we use relatively slow scorings and structure generation is not a limitation step. But for fast scorings this can be already an issue and if this can be improved by low cost we can do that.
The amount of duplicated generated structures is up to 65% per molecule that is too much that I expected.
The major function consuming 90% of runtime is __frag_replace
. There are four most time consuming blocks (sorry, I just noticed that I run code on local crem version, so line numbers should be reduced on 13):
-
set_protected_atoms
(line 212) - 25% of time -
RunReactants
(line 224) - apply reaction SMARTS - 10% -
SanitizeMol
(line 227) - it never returned errors but I think it is required to recalculate internal states of Mol object to avoid errors in future uses (do not remember exactly) - 10% -
Chem.MolToSmiles(Chem.RemoveHs(p), isomericSmiles=True)
(line 232) - convert to smiles to identify duplicates - 40%
Conversion to smiles takes most of the time. I expect that RemoveHs is also expensive because create a new Mol object. This step can be replaced with inchi to identify duplicates (could be faster and no need to address the issue of explicit hydrogens). But as a main output of all functions I use smiles and their generation is inevitable or we need to change the output of all functions to return not smiles but Mol objects and transfer responsibility to manage explicit hydrogens on a user. Changing the interface seems not a good idea but if this will speed up things greatly we can do that. Use SMILES as a major output probably was not a great idea.
set_protected_atoms
looks quite complex and may be simplified to speed up.
remove sanitization step is questionable, because again this transfers responsibility to a user to manage generated Mol objects and solve issues with them.
Concluding remarks:
- avoid generation of duplicates can substantially reduce runtime. I have no solution currently.
- I would like to avoid changes in API to not break compatibility but if this will greatly speed up things I'm ready to do that.
- I would not like to shift responsibility to a user to manage internal states of generated molecules and catch other errors. Just dislike to write repeatedly the same code and remember of these potential issues.
Alternative "solution".
I also considered to completely refuse from generation of compounds through RDKit reactions and make replacements manually because ids of removing atoms are known and we need to take the remaining part and link new atoms. This will completely remove the first two points from the list (this may speed up things but not necessary), but sanitization and generation of smiles will probably remain. This will also require to change other parts of code (at least __fragment_mol
to get ids of attachment points of the context and a whole remaining part).
Any suggestions/ideas are welcome.
Investigating issue with duplicates I found and fixed the bug of generation of duplicated fragments before their mutation. All fragments with 1 or 2 attachment points were duplicated, thus twice number of replacements were performed and they resulted in the same molecules. This fix gave x2 speed up. Now the number of duplicates in generated molecules is from 0 to 32% on test samples.
Thanks for working on this!
If I was only interested in the rdkit mol objects, would I able to get even more speed by not doing the Chem.MolToSmiles(...)
conversion on line 221? Currently, I'm only using mutate_mol
with the parameter return_mol
activated.
Yes, this should speed up on 30-40%, but there will be about 20-25% (in average) of duplicated molecules in the output. If you are ok with them you may remove generation of smiles and change the further code accordingly.
I'm still thinking how to improve the situation, but I do not see a way to avoid substantial changes. The worst thing is that it is difficult to estimate whether these changes will speed up whole generation or not.
Would it be faster to use fingerprints to find duplicates instead of SMILE strings?
I do not know you may try to measure performance. But fingerprints have collisions and thus some molecules will be falsely identified as duplicates. So some compounds will be erroneously discardred. The number of such false duplicates depends on fingerprint type and structures. I never explored this issue.