ProLIF Parallelization

This is not an issue, it is more a question/suggestion. Is it possible to parallelize the method run of the class Fingerprint? I was trying to look on the MDanalysis documentation and this is not so straightforward because how MDAnalysis.core.universe.Universe.trajectory is designed. But I also read about PMDanalysis. So, should not be possible to incorporate the parralelization to ProLIF? I think that this feature will really improve the package and the usability.

Mar 15 '22 09:03 ale94mleon

Hi @ale94mleon,

It's possible to parallelize the run method of ProLIF, and it's something I plan on including in the code at some point. In the meantime, here's a script to do that:

import multiprocessing as mp
from tqdm.auto import tqdm
import prolif as plf
import MDAnalysis as mda

# setup the mda.Universe, lig and prot selections
# ...

# parameters for the parallel run
N_PROCESSES = 8
frames = list(range(u.trajectory.n_frames)) 
interactions = ['HBDonor', 'HBAcceptor', 'PiStacking', 'Anionic', 'Cationic', 'CationPi', 'PiCation']

# run in parallel 
def job(frame):
    fp = plf.Fingerprint(interactions)
    fp.run(u.trajectory[frame:frame+1], lig, prot, progress=False)
    return fp.ifp[0]

with mp.Pool(N_PROCESSES) as pool:
    results = []
    # trigger MDAnalysis caching
    lig.convert_to.rdkit()
    prot.convert_to.rdkit()
 
    for ifp in tqdm(pool.imap_unordered(job, frames),
                    total=len(frames)):
        results.append(ifp)

df = plf.to_dataframe(results, interactions)

This will run on all frames of your trajectory, if you only want a subset of the trajectory make sure to change frames = list(range(u.trajectory.n_frames)) to what you need. It will run 8 different processes in parallel, adjust that number according to your machine.

Mar 15 '22 11:03 cbouy

Cool! This looks very nice. Thanks @cbouy !!

Mar 15 '22 12:03 ale94mleon

Something I noticed when trying to create prolif molecules is that the rdkit mol user assigned property 'map index' was missing if I used mp.Pool. I imagine this is the case for other user assigned properties, if they exist. I believe this issue arose due to the pickling of the molecule objects when multiprocessing is run. I fixed this by running: Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps) Thought I'd just point this out in case this was something you weren't aware of!

Jun 27 '22 13:06 noahharrison64

Just to come back to this - It seems like the solution I posted above has its issues. If I try to access map index property on a mol run through the multiprocessor (with Chem DefaultPickleProperties assigned to All), the map index is available but it doesn't correspond to the correct atomic numbering in the input file. If I do the same without the multiprocessing then the atomic numbering is correct.

Jul 08 '22 11:07 noahharrison64

That doesn't sound right! Thanks for reporting it, I'll try to have a look soon

Jul 10 '22 21:07 cbouy