biosimspace icon indicating copy to clipboard operation
biosimspace copied to clipboard

Faster MCS Calculation

Open kexul opened this issue 2 years ago • 7 comments

The current matchAtoms() uses the entire molecule to get MCS, a simple idea to accelerate it is to get MCS for heavy atoms and then match up hydrogens hanging off the MCS. See this implementation for example: https://github.com/OpenFreeEnergy/Lomap/blob/84e1bb0d20ceeed835166fe7218dd21f04248030/lomap/mcs.py#L1170

I didn't quantitatively test the speedup(no benchmark found) but it performed very well for large molecules in my project ( the current matchAtoms() often times out).

kexul avatar Feb 14 '23 07:02 kexul

Hi @kexul - have you tried matchAtoms on molecules without hydrogens, and then using the resulting MCS to build a prematch to pass as keyword argument for another call to matchAtoms using the complete molecules ? I think that could work. Note that this approach may not give you the same MCS that you get calling matchAtoms() on the full molecules. If you can share examples of molecules that are slow to matchI may be able to advise on ways to speed things up.

jmichel80 avatar Feb 14 '23 07:02 jmichel80

Hi @kexul - have you tried matchAtoms on molecules without hydrogens, and then using the resulting MCS to build a prematch to pass as keyword argument for another call to matchAtoms using the complete molecules ?

Hi @jmichel80 - Yes! I've used this approach for a long time. But it often leads to prematch not found in rdkit search results and then Sire is used to find the full MCS (add hydrogen to MCS). I'm not sure what algorithm Sire is used, but it's painfully slow.

If you can share examples of molecules that are slow to match may be able to advise on ways to speed things up.

I'll try to provide some examples but it may take some time. 😢

kexul avatar Feb 14 '23 08:02 kexul

Thanks, @kexul, I'll look into this. Given that we use the same tools behind the scenes it should be easy enough to implement (or test) the approach that you suggest. Before we switched to using RDKit for MCS (we used to just use the Sire functionality) we had custom code to do similar to what you suggest. The issue we found was, as @jmichel80 says, that the MCS might differ to the full molecule search. In order to reliably get the mappings that we wanted we ended up doing both approaches and taking the larger mapping, which is obviously very wasteful.

lohedges avatar Feb 14 '23 09:02 lohedges

@jmichel80: The prematch approach won't work since there's no way to use it as a seed for the MCS. Unfortunately we just use it retrospectively, i.e. checking that the MCS generated by RDKit does contain the prematch that the user wants. (If there are multiple MCS with the same size, then we return the one that matches the prematch, if it exists.) It could well be that the algorithm passed through a state containing the pre-match during its search, but we wont know this, since we only get the final result. If nothing matches the pre-match, then we fall back on Sire, which does check the pre-match at each stage of the MCS algorithm, i.e. as it grows in size.

lohedges avatar Feb 14 '23 09:02 lohedges

@kexul by how much do you have to increase the timeout argument to reach completion for your examples? It may be preferable to have a more robust and time consuming MCS search rather than a faster one that is more error prone. I depends on the computing cost of the operations downstream of the MCS search.

jmichel80 avatar Feb 14 '23 09:02 jmichel80

Sorry, it takes much longer than I think to find a proper example. I'll need some time to dig my old files... @jmichel80 In my previous experiments, I tried increasing the timeout to 5 min but matchAtoms() still fails to find a reasonable mapping.

kexul avatar Feb 14 '23 12:02 kexul

I see the RDKit MCS code has an InitialSeed option. Perhaps we could try matching with heavy atoms only, then using the MCS as a seed for a full atom search?

lohedges avatar Mar 03 '23 15:03 lohedges