pmda Bad Performance of Parallelization with On-the-fly Transformation

Bad Performance of Parallelization with On-the-fly Transformation

Open yuxuanzhuang opened this issue 3 years ago • 10 comments

Expected behaviour

Analysis of a Universe with on-the-fly transformation scales good (reasonable).

Actual behaviour

The scaling performance is really bad even with two cores.

Code

import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['LONG_TRAJ']) #  9000 frames

fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

n_jobs = [1, 2, 4, 8, 16, 32, 64]

rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
               n_jobs=nj) #  timeit

Reason

In some Transformations includes numpy.dot which itself is multi-threaded. So the cores are oversubscribed.

Possible solution

define NUM_THREADS=1 for numpy (https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.
use cupy(https://cupy.dev/) to leverage the GPU power. (only replacing the numpy.dot operation of the Transformation)

Benchmarking result

Linear Scaling RMSD with Transformation Comparison

Benchmarking system:
- AMD EPYC 7551 32-Core Processor
- RTX 2080 Ti
- cephfs file system

Currently version of MDAnalysis:

(run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev (run python -c "import pmda; print(pmda.__version__)") (run python -c "import dask; print(dask.__version__)")

Sep 18 '20 16:09 yuxuanzhuang

pmda pmda copied to clipboard

Bad Performance of Parallelization with On-the-fly Transformation

Expected behaviour

Actual behaviour

Code

Reason

Possible solution

Benchmarking result

Currently version of MDAnalysis:

pmda
pmda copied to clipboard