Parallelization
This is not an issue, it is more a question/suggestion. Is it possible to parallelize the method run of the class Fingerprint? I was trying to look on the MDanalysis documentation and this is not so straightforward because how MDAnalysis.core.universe.Universe.trajectory is designed. But I also read about PMDanalysis. So, should not be possible to incorporate the parralelization to ProLIF? I think that this feature will really improve the package and the usability.
Hi @ale94mleon,
It's possible to parallelize the run method of ProLIF, and it's something I plan on including in the code at some point. In the meantime, here's a script to do that:
import multiprocessing as mp
from tqdm.auto import tqdm
import prolif as plf
import MDAnalysis as mda
# setup the mda.Universe, lig and prot selections
# ...
# parameters for the parallel run
N_PROCESSES = 8
frames = list(range(u.trajectory.n_frames))
interactions = ['HBDonor', 'HBAcceptor', 'PiStacking', 'Anionic', 'Cationic', 'CationPi', 'PiCation']
# run in parallel
def job(frame):
fp = plf.Fingerprint(interactions)
fp.run(u.trajectory[frame:frame+1], lig, prot, progress=False)
return fp.ifp[0]
with mp.Pool(N_PROCESSES) as pool:
results = []
# trigger MDAnalysis caching
lig.convert_to.rdkit()
prot.convert_to.rdkit()
for ifp in tqdm(pool.imap_unordered(job, frames),
total=len(frames)):
results.append(ifp)
df = plf.to_dataframe(results, interactions)
This will run on all frames of your trajectory, if you only want a subset of the trajectory make sure to change frames = list(range(u.trajectory.n_frames)) to what you need.
It will run 8 different processes in parallel, adjust that number according to your machine.
Cool! This looks very nice. Thanks @cbouy !!
Something I noticed when trying to create prolif molecules is that the rdkit mol user assigned property 'map index' was missing if I used mp.Pool. I imagine this is the case for other user assigned properties, if they exist. I believe this issue arose due to the pickling of the molecule objects when multiprocessing is run. I fixed this by running: Chem.SetDefaultPickleProperties(Chem.PropertyPickleOptions.AllProps) Thought I'd just point this out in case this was something you weren't aware of!
Just to come back to this - It seems like the solution I posted above has its issues. If I try to access map index property on a mol run through the multiprocessor (with Chem DefaultPickleProperties assigned to All), the map index is available but it doesn't correspond to the correct atomic numbering in the input file. If I do the same without the multiprocessing then the atomic numbering is correct.
That doesn't sound right! Thanks for reporting it, I'll try to have a look soon