mdanalysis
mdanalysis copied to clipboard
the `InterRDF_s` class is significantly slower than running multiple `InterRDF` instances sequentially
Expected behavior
Recently, I have been conducting electrolyte simulation-related work using Gromacs 2025.1. Since I frequently use Python, I did not use gmx rdf. For multiple RDF calculations, I use InterRDF_s to accelerate.
Actual behavior
When analyzing multiple RDF pairs on a GROMACS 2025.1 trajectory, the InterRDF_s class is significantly slower than running multiple InterRDF instances sequentially, despite its design for batch processing.
I conducted an RDF analysis on the same group of molecules. When verbose=True is set, InterRDF_s shows a speed of 3.32 it/s while InterRDF shows a speed of 141.48 it/s. With the help of AI, I used cProfile for function-level analysis. The result showed that numpy.histogram was called too frequently, which affected the speed.
Code to reproduce the behavior
import MDAnalysis as mda
from MDAnalysis.analysis import rdf
u = mda.Universe("s4_pr.tpr", "s4_pr.trr")
li_sel = u.select_atoms("resname LI")
an_sel = u.select_atoms("resname AN")
rdf_calc = rdf.InterRDF(li_sel, an_sel,verbose=True)
rdf_calc.run(stop=50)
rdf_calcs = rdf.InterRDF_s(u, [[li_sel,an_sel]],verbose=True)
rdf_calcs.run(stop=50)
Current version of MDAnalysis
- Which version are you using? (run
python -c "import MDAnalysis as mda; print(mda.__version__)") 2.9.0 - Which version of Python (
python -V)? python 3.11.11(numpy==2.2.0) - Which operating system? ubuntu22
I made some temporary modifications to the code and the processing speed became reasonable. The for loop in mdanalysis/package/MDAnalysis/analysis/rdf.py might not be the most reasonable design.
def _single_frame(self):
for i, (ag1, ag2) in enumerate(self.ags):
pairs, dist = distances.capped_distance(
ag1.positions,
ag2.positions,
self._maxrange,
box=self._ts.dimensions,
)
# RAW
# for j, (idx1, idx2) in enumerate(pairs):
# count, _ = np.histogram(dist[j], **self.rdf_settings)
# self.results.count[i][idx1, idx2, :] += count
# modifications
count, _ = np.histogram(dist, **self.rdf_settings)
self.results.count[i][0, 0, :] += count
if self.norm == "rdf":
self.volume_cum += self._ts.volume
@gitzhangch thanks for looking into the code. Suggestions for performance improvements are always very welcome!
Could you create a PR with your changes? Then we can see if the tests pass and can discuss your proposed code in more detail.
@orbeckst I have opened a draft pull request with my proposed changes. Since I'm not fully familiar with the overall design yet, I keep my modifications conservative.