pmda
pmda copied to clipboard
Bad Performance of Parallelization with On-the-fly Transformation
Expected behaviour
Analysis of a Universe
with on-the-fly transformation scales good (reasonable).
Actual behaviour
The scaling performance is really bad even with two cores.
Code
import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd
u = mda.Universe(files['PDB'], files['LONG_TRAJ']) # 9000 frames
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)
n_jobs = [1, 2, 4, 8, 16, 32, 64]
rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
n_jobs=nj) # timeit
Reason
In some Transformations
includes numpy.dot
which itself is multi-threaded. So the cores are oversubscribed.
Possible solution
- define NUM_THREADS=1 for
numpy
(https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance. - use
cupy
(https://cupy.dev/) to leverage the GPU power. (only replacing thenumpy.dot
operation of theTransformation
)
Benchmarking result
- Benchmarking system:
- AMD EPYC 7551 32-Core Processor
- RTX 2080 Ti
- cephfs file system
Currently version of MDAnalysis:
(run python -c "import MDAnalysis as mda; print(mda.__version__)"
) 2.0.0 dev
(run python -c "import pmda; print(pmda.__version__)"
)
(run python -c "import dask; print(dask.__version__)"
)