mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Add option for `gemmi` backend for PDB reading

Open marinegor opened this issue 2 months ago • 19 comments

Is your feature request related to a problem?

It's more of a suggestion.

Describe the solution you'd like

Pull #4712 is about to introduce gemmi-based backend for mmCIF reading. It wasn't intended but, MMCIFReader/Parser can also read pdb files:

import MDAnalysis as mda
mda.Universe('testsuite/MDAnalysisTests/data/4E43.pdb', format='mmcif')
# <Universe with 1877 atoms>

since gemmi, being de-facto standard for RCSB stuff, can read both into the same gemmi.Model interface. Perhaps it's worth including an option to read pdb with gemmi backend then? I'm not sure how to technically do that though -- for instance, we could add format = 'pdb_gemmi' to MMCIFParser, and document this behavior in PDBParser.

Describe alternatives you've considered

Another option would be to just add a paragraph in documentation about this in MMCIFParser and/or PDBParser without changing any code -- I'd be probably fine with that, but just don't want this information to stay undocumented.

@MDAnalysis/coredevs do you have an opinion on that?

marinegor avatar Nov 06 '25 09:11 marinegor

Can you do a performance comparison between reading with our PDBParser/Reader vs gemmi?

If gemmi is comparable or faster then we just have to do a feature comparison to see if we should make gemmi the default (and possibly retire our version).

orbeckst avatar Nov 14 '25 01:11 orbeckst

For format: perhaps expand its syntax along the lines of including an optional library for parsing like gemmi|pdb or chemfiles|xtc .

 format="<library>|<fileformat>"

We could introduce "*" for the file format and then chemfiles|* means "use chemfiles for anything it can read". Similarly, gemmi|* then says, "use gemmi for CIF and PDB files".


I originally wanted to use : as separator but this will collide with URIs such as the ones we use for streaming such as "localhost://8888", which could be written explicitly with the imdclient as the library as "imdclient|localhost://8888".

Or does "[imdclient]localhost://8888" or "[gemmi]*" look better??? Has the disadvantage that [ and ] may occur directly in URIs as reserved chars (not %-encoded) (RFC 3986, Section 2.

orbeckst avatar Nov 14 '25 01:11 orbeckst

How does gemmi deal with "PDB trajectories", i.e., multi-MODEL files, often with hundreds of frames? This is still the lowest common denominator for a MD trajectory, I thik...

orbeckst avatar Nov 14 '25 01:11 orbeckst

@orbeckst

How does gemmi deal with "PDB trajectories", i.e., multi-MODEL files, often with hundreds of frames?

I think it does, since it handles NMR depositions as well, which are multi-model. I haven't implemented that in my PR, but it can be added in the future, and this moment can also be a moment to retire a legacy PDB reader.

Or does "[imdclient]localhost://8888" or "[gemmi]*" look better???

I would not be in favor of neither [] nor | based syntax, and rather explicitly specify reader = ... or format = ... for the Reader. You can have a look at e.g. pandas syntax for read_parquet, where they explicitly specify engine=....

marinegor avatar Nov 14 '25 10:11 marinegor

Can you do a performance comparison between reading with our PDBParser/Reader vs gemmi?

our PDBParser/Reader is faster, if you just read a file:

In [13]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.cif')
467 ms ± 9.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.pdb')
140 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.cif.gz')
480 ms ± 850 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.pdb.gz')
168 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But I feel like there's some constant overhead that I don't know about yet. Perhaps we should add feature for performance optimization of it, what's your opinion?

marinegor avatar Nov 14 '25 10:11 marinegor

As part of the EOSS4 performance enhancements @richardjgowers looked into fast text file readers. The omnireader in C++ with Cython bindings was a very promising proof of principle (potential for 5-10x speed-up) but we didn't have time to fully package it and use it for the core library. I'll share the final report with you on discord.

orbeckst avatar Nov 14 '25 15:11 orbeckst

would not be in favor of neither [] nor | based syntax, and rather explicitly specify reader = ... or format = ... for the Reader. You can have a look at e.g. pandas syntax for read_parquet, where they explicitly specify engine=....

That's fair.

Using an explicit engine=... has the advantage that it's programmatically clearer. I feel that the word "engine" is not very semantically clear although I'll just use it for this discussion. In the context of a Universe we'd then have to use engine for coordinates (format) and topology_engine (for topology_format).

I am pretty sure that explicitly specifying a ReaderClass or a ParserClass is too complicated for most folks as you have to know from where to import it in the first place.

orbeckst avatar Nov 14 '25 15:11 orbeckst

Based on your performance comparison, I would not want to default to gemmi for PDB reading, as long as using the PDBReader produces correct results.

orbeckst avatar Nov 14 '25 15:11 orbeckst

our PDBParser/Reader is faster, if you just read a file

Which results are which for this? Currently I am just seeing comparisons between PDB / CIF which would have different speeds, and the comparisons with the .gz versions will have the additional overhead of decompressing before reading as well?

BradyAJohnston avatar Nov 14 '25 16:11 BradyAJohnston

@BradyAJohnston haha you're right, I didn't actually do the comparison😁

here it is:

In [1]: import MDAnalysis as mda

In [2]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.pdb.gz')
166 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit u = mda.Universe('/Users/egormarin/Downloads/pdb-mmcif-bench/1FFK.pdb.gz', format='pdb_gemmi')
423 ms ± 9.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

it's still slower though.

I'll look into the reasons for it, and after it's speed is on par, we might switch to it fully, but for now I suggest just having this option explicitly for opt-in.

As for this @orbeckst

I am pretty sure that explicitly specifying a ReaderClass or a ParserClass is too complicated for most folks as you have to know from where to import it in the first place.

I think we had this pattern earlier with backend=... for AnalysisBase -- it's type is Literal['dask', 'multiprocessing'] | BackendBase, which covers both "I want to use built-in" and "I want to customize the behavior".

marinegor avatar Nov 14 '25 17:11 marinegor

If it's a string then it's not a problem; I was thinking about passing a class (which you can do).

orbeckst avatar Nov 14 '25 18:11 orbeckst

IIRC the current PDB parser has "custom cases", that were added to support random behaviour (abuses of the PDB standard) that some MD engines / external tools like to use.

How does gemmi differ in terms of behaviour?

My main thing here is that if the behaviours change, I would vote for not try to shoehorn in a different backend, but instead just having a different class that we call differently.

We might eventually decide that our current PDB parser is no good (indeed I'm tired of all the shims we have to put into it), and instead we default to gemmi. But it would be better for us to properly split things with different behaviour, rather than have just an optional backend and in some cases it behaves in one way and in another it behave differently.

Edit: to summarise my views on the above - backends are fine when you have the SAME behaviour, but not when you have differing behaviour.

IAlibay avatar Nov 14 '25 20:11 IAlibay

I'd also be in favour of replacing our current parser/reader either with gemmi or chemfiles. Here are some more timings of our PDBParser, chemfiles, and gemmi:

import MDAnalysis as mda
from MDAnalysisTests.datafiles import PDB
import chemfiles
import gemmi

%timeit mda.Universe(PDB)
420 ms ± 6.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit mda.Universe(PDB, format='cif')  # uses the parser and reader in #4712 
1.18 s ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compare reading coordinates only (as there's no chemfiles topology parser):

%timeit mda.coordinates.PDB.PDBReader(PDB)
272 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit mda.coordinates.MMCIF.MMCIFReader(PDB)
434 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit mda.coordinates.chemfiles.ChemfilesReader(PDB)
85.2 ms ± 997 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using chemfiles and gemmi directly

%timeit chemfiles.Trajectory(PDB).read()
43.1 ms ± 424 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit gemmi.read_structure(PDB)
16.3 ms ± 196 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Should we consider replacing PDBReader with ChemfilesReader, if the behaviour is the same?

It would also be good to understand why MMCIFReader is slower than ChemfilesReader, as (I think) chemfiles uses gemmi under the hood for reading PDBs.

Edit: there are other formats chemfiles could be the default reader for https://github.com/MDAnalysis/mdanalysis/issues/2731

p-j-smith avatar Nov 14 '25 23:11 p-j-smith

@p-j-smith thanks, I didn't know about existence of chemfiles!

I think that one of the reasons current reader is slow is because it's reading the file twice -- once per topology and coordinate reads. Based on my small py-spy benchmark, both of these reads constitute for roughly 50% of the whole runtime (on mac m4).

I feel like I should be able to do some optimization related to that (e.g. in the coordinates reader read both coordinates and topology) -- do you know any examples of that in the existing codebase? This would also allow for better reading of streamed trajectories, where you really don't want to read things twice (cc @ljwoods2 here)

@IAlibay

How does gemmi differ in terms of behaviour?

I actually have no idea, haven't tested it. But I totally can imagine that for some cases it'd be different, since they're technically not standard-compliant. And the standard is frozen by now, so they never will be, leaving our current reader and its modifications the only way to read them, if gemmi can't but you really want to.

marinegor avatar Nov 16 '25 11:11 marinegor

@marinegor For streaming at least, there's no example of caching a topology read and leveraging that information for the trajectory or vice versa, and I'm not sure we could make use of this if possible anyways since IMD doesn't include topology information.

One potential solution could be allow a gemmi.Structure as a valid input to the MMCIFReader- and then some universe creation logic that detects the case where the same CIF file is both the topology and single frame trajectory and passing the gemmi.Structure through to the reader instead of re-parsing it.. a bit janky though

Also, it looks like internally chemfiles passes a CIF file contents string directly to gemmi rather than a filepath by invoking gemmi.cif.read_string on the C++ side to get a gemmi.Document and then converting a Document -> Block -> Structure - https://github.com/chemfiles/chemfiles/blob/3c849eb3cf0ff72076d2baf4b97b5b8378e5f4bf/src/formats/CIF.cpp#L73

But for PDB legacy I don't think it uses gemmi at all, they seem to have their own parser @p-j-smith https://github.com/chemfiles/chemfiles/blob/3c849eb3cf0ff72076d2baf4b97b5b8378e5f4bf/src/formats/PDB.cpp

My PR similarly passes a CIF/PDB string directly to gemmi rather than a path, and this seems to lead to some speedup (https://github.com/marinegor/mdanalysis/pull/4):

%timeit -r 20 mda.Universe(PDB)
753 ms ± 191 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.Universe(PDB, format='cif')
863 ms ± 236 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.coordinates.PDB.PDBReader(PDB)
469 ms ± 157 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.coordinates.MMCIF.MMCIFReader(PDB)
135 ms ± 27.2 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.coordinates.chemfiles.ChemfilesReader(PDB)
191 ms ± 22 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.topology.PDBParser.PDBParser(PDB).parse()
391 ms ± 158 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

%timeit -r 20 mda.topology.MMCIFParser.MMCIFParser(PDB).parse()
711 ms ± 159 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

It does seem like topology parsing is where the gemmi parsers are losing, so maybe there's some profiling/optimization that can be done there

ljwoods2 avatar Nov 16 '25 17:11 ljwoods2

It would also be good to understand why MMCIFReader is slower than ChemfilesReader, as (I think) chemfiles uses gemmi under the hood for reading PDBs.

No, chemfiles has a custom implementation for PDB and mmCIF readers. It does use GEMMI to read crystallographic CIF files, in the code found by @ljwoods2.

Luthaf avatar Nov 17 '25 09:11 Luthaf

@ljwoods2 thanks for the PR, I've added it on top!

regarding two separate readings, technically current implementation might yield incorrect results if the file on disk changes between reading topology and trajectory, right? I'm completely fine with it though, just wanted to re-iterate😁

and then some universe creation logic that detects the case where the same CIF file is both the topology and single frame trajectory

I wouldn't actually mind that, it seems like a correct idea (at some point), also because of validity point.

It does seem like topology parsing is where the gemmi parsers are losing, so maybe there's some profiling/optimization that can be done there

Yes, indeed. So until time parity is reached, I'd suggest not making this reader default.

Sidenote: I'm getting some weird results on my machine. Namely, when using legacy reader, they're not additive for reading topology + coordinates vs reading Universe:

mdanalysis ❯ uv run profile.py
Measured what='coordinates' reader='pdb' gzipped=True n_cycles=20
mean=168 ms	std= 6
Measured what='topology' reader='pdb' gzipped=True n_cycles=20
mean=105 ms	std= 2
Measured what='universe' reader='pdb' gzipped=True n_cycles=20
mean=173 ms	std= 6
Measured what='coordinates' reader='cif' gzipped=True n_cycles=20
mean=162 ms	std= 4
Measured what='topology' reader='cif' gzipped=True n_cycles=20
mean=281 ms	std= 6
Measured what='universe' reader='cif' gzipped=True n_cycles=20
mean=442 ms	std=11

script is here: profile.py

do you have any idea why is that happening?

marinegor avatar Nov 17 '25 17:11 marinegor

@marinegor I suspect it is this line which only occurs when the PDBReader is called outside of Universe context and parses the entire PDB file a second time:

https://github.com/MDAnalysis/mdanalysis/blob/3189d484faac4797c9f9573c3bdd3da6df12a092/package/MDAnalysis/coordinates/PDB.py#L285

From jupyter timing, passing n_atoms to the reader seems to resolve:

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PDB

f = PDB

%timeit -r 20 mda.coordinates.PDB.PDBReader(f, n_atoms=47681)
91.9 ms ± 591 μs per loop (mean ± std. dev. of 20 runs, 10 loops each)

%timeit -r 20  mda.topology.PDBParser.PDBParser(f).parse()
176 ms ± 1.43 ms per loop (mean ± std. dev. of 20 runs, 10 loops each)

%timeit -r 20  mda.Universe(f, format="pdb")
394 ms ± 3.33 ms per loop (mean ± std. dev. of 20 runs, 1 loop each)

ljwoods2 avatar Nov 17 '25 18:11 ljwoods2

Note to self: consider adding "setup_entities" after reading pdb, following gemmi recommendations:

If your programs reads PDB files, it is a good idea to call setup_entities() after read_structure() because many of the gemmi functions depend on it.

marinegor avatar Nov 20 '25 15:11 marinegor