mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

MDAnalysis 6 times slower than pytraj to read GROMACS's trr file

Open gmonet opened this issue 2 years ago • 6 comments

Actual behavior

Hello to all, I found that MDAnalysis was significantly slower to read trr files than pytraj. With the test files included in MDAnalsyis, MDAnalysis is 6x slower than pytraj. With my own files (14573 atoms), it is up to 15x slower. Honestly I prefer to use the MDAnalysis API but this poor performance makes me switch to pytraj. Best.

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD,  GRO, PDB, TPR, XTC, TRR,  PRMncdf, NCDF
import pytraj

u = mda.Universe(PDB, TRR)
u.trajectory
# <TRRReader .../anaconda3/lib/python3.8/site-packages/MDAnalysisTests/data/adk_oplsaa.trr with 10 frames of 47681 atoms>
%timeit u.trajectory[0].positions
# 6.41 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
traj = pytraj.TrajectoryIterator(TRR, PDB)
%timeit traj[0].coordinates
# 1.09 ms ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") MDAnalysis : 2.2.0 pytraj : 2.0.5
  • Which version of Python (python -V)? Python 3.8.8
  • Which operating system? Ubuntu 20.04.4 LTS

gmonet avatar Jun 23 '22 08:06 gmonet

Do you measure the time on the first read or on subsequent ones? On first read, MDAnalysis builds an index of the frames so it can do random access later. That index is cached so subsequent reads should be faster.

jbarnoud avatar Jun 23 '22 08:06 jbarnoud

Even after the index building, MDAnalysis is still slow for reading trr files :

import numpy as np
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD,  GRO, PDB, TPR, XTC, TRR,  PRMncdf, NCDF
import pytraj
u = mda.Universe(PDB, TRR)
traj = pytraj.TrajectoryIterator(TRR, PDB)
# MDAnalysis performance for reading trr
%%time
for i in range(100):
    step = np.random.randint(0,9)
    u.trajectory[step].positions

CPU times: user 645 ms, sys: 4.07 ms, total: 650 ms Wall time: 649 ms

# pytraj performance for reading trr
%%time
for i in range(100):
    step = np.random.randint(0,9)
    traj[step].coordinates

CPU times: user 108 ms, sys: 20.1 ms, total: 129 ms Wall time: 127 ms


Actually, if I convert the trr file to Amber netcdf file, I get a massive boost of performance:

# Convert trr to nc file with cpptraj.
import os
from pathlib import Path
NCDF = str(Path(TRR).with_suffix('.nc'))
_ = os.popen(f'cpptraj -p {PDB} -y {TRR} -x {NCDF}')
u = mda.Universe(GRO, NCDF)
traj = pytraj.TrajectoryIterator(NCDF, PDB)
# MDAnalysis performance for reading nc
%%time
for i in range(100):
    step = np.random.randint(0,9)
    u.trajectory[step].positions

CPU times: user 13.4 ms, sys: 109 µs, total: 13.5 ms Wall time: 12.7 ms

# pytraj performance for reading nc
%%time
for i in range(100):
    step = np.random.randint(0,9)
    traj[step].coordinates

CPU times: user 44.2 ms, sys: 28 ms, total: 72.2 ms Wall time: 70.4 ms

MDAnalysis is even faster than pytraj for reading Amber netcdf files.

gmonet avatar Jun 23 '22 09:06 gmonet

I did a quick test on XTC file performance and they are comparable. It seems that only TRR files are slow.

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD,  GRO, PDB, TPR, XTC, TRR,  PRMncdf, NCDF
import pytraj

u = mda.Universe(PDB, XTC)
%timeit u.trajectory[0].positions
1.96 ms ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

traj = pytraj.TrajectoryIterator(XTC, PDB)
%timeit traj[0].coordinates
2.12 ms ± 8.08 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

yuxuanzhuang avatar Jun 23 '22 09:06 yuxuanzhuang

I see, the problem is indeed with the trr format. Unfortunately, I am forced to work on the trr format because it is the only format in which GROMACS records speeds and forces.

gmonet avatar Jun 23 '22 09:06 gmonet

I see, the problem is indeed with the trr format. Unfortunately, I am forced to work on the trr format because it is the only format in which GROMACS records speeds and forces.

A temporary solution could be loading the trajectory into the memory (if you memory can hold it ofc)

u = mda.Universe(PDB, TRR, in_memory=True)
%timeit u.trajectory[0].positions
3.44 µs ± 27.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

yuxuanzhuang avatar Jun 23 '22 09:06 yuxuanzhuang

I'm affraid I can't afford to load the entire trajectory (about 10Go) into the memory (I analyse multiple trajectory in parallel) :/

gmonet avatar Jun 23 '22 10:06 gmonet