mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Parse hexadecimal resid from OpenMM in PDBParser

Open rladbtls1998 opened this issue 5 months ago • 20 comments

Is your feature request related to a problem?

OpenMM generate a PDB file with hexadecimal resid if there are more than 9999 residues. However, in MDAnalysis, hexadecimal resids (treated as faulty resid) are set to 1, which causes problems in some cases.

Describe the solution you'd like

I'd like to change PDBParser._parseatoms to handle Hexadecimal resid, if there are alphabets in resid. Changing https://github.com/MDAnalysis/mdanalysis/blob/c4af00babc9fc69671f1ea182b8b3a9b8807230d/package/MDAnalysis/topology/PDBParser.py#L302-L304 to

                    else:
                        if any([a.isalpha() for a in line[22:26]]):
                            resid = int(line[22:26], base=16) - 30960
                        else:
                            resid = int(line[22:26])
                        # Wrapping

Describe alternatives you've considered

Or setting unique resids (increment from last resid / preset failed resid starting number e.g. 10000) for residues with failed resid parsing can be another option (while this can be problematic for other cases). Or editing Universe to get optional Parser(TopologyReaderBase) so one can give custom topology parser if needed.

Additional context

openmm.app.pdbfile.py : 484 (from OpenMM version 8.2.0)

def _formatIndex(index, places):
    """Create a string representation of an atom or residue index.  If the value is larger than can fit
    in the available space, switch to hex.
    """
    if index < 10**places:
        format = f'%{places}d'
        return format % index
    format = f'%{places}X'
    shiftedIndex = (index - 10**places + 10*16**(places-1)) % (16**places)  <----- This part
    return format % shiftedIndex

One major problem is that OpenMM uses decimal until resid 9999 and starts from hex A000 (== decimal 40960) in place of decimal resid 10000, so parser needs to subtract 30960 or just use resid 40960. I think both may cause problems in other parts.

rladbtls1998 avatar Jul 24 '25 05:07 rladbtls1998

i can solve this issue

aniketpati1121 avatar Jul 26 '25 04:07 aniketpati1121

@aniketpati1121 while the issue itself is simple, I think it needs reviews or decisions from contributors.

rladbtls1998 avatar Jul 30 '25 03:07 rladbtls1998

If you look at #1897 and PR #1978 then I'd say MDAnalysis already supports hybrid36 PDB format.

Or is what you're looking for not "hybrid36"?

orbeckst avatar Aug 04 '25 02:08 orbeckst

Thanks. didn't know about hybrid36 format. However, OpenMM seems to use hexadecimal (base16), not hybrid36 format (base 36). (openmm/openmm#3501 and openmm/openmm#3522) (in openmm/openmm#659 mentioned in #1897, they ended up implementing hex, not hybrid36)

I don't think it is practical to deal with every non-standard pdb format, so imo it is good enough to parse hex resid with hybrid36, since it prevents duplicate resids. or Letting Universe __init__ to accept optional parser (TopologyReaderBase) argument, so that users can provide custom parsers for non-standard formats, could also solve this issue.

Another minor problem is that hybrid36 PDB format is only used in atom serial number, but not in residue ids. Simply changing https://github.com/MDAnalysis/mdanalysis/blob/c4af00babc9fc69671f1ea182b8b3a9b8807230d/package/MDAnalysis/topology/PDBParser.py#L302-L304 to

                else:
                   resid = hy36decode(4, line[22:26])
                    # Wrapping

does fix the missing hy36 resid issue.

rladbtls1998 avatar Aug 04 '25 05:08 rladbtls1998

Letting Universe __init__ to accept optional parser (TopologyReaderBase) argument, so that users can provide custom parsers for non-standard formats, could also solve this issue.

You can always provide a parser-subclass of TopologyReaderBase as a kwarg for Universe for topology_format=YourParser (and similar for the trajectory format=YourReader).

orbeckst avatar Aug 04 '25 16:08 orbeckst

If I read your issue and the linked openmm issue/PR correctly then you're really asking for supporting reading of VMD-style hex numbered PDBs, right?

I don't think it is practical to deal with every non-standard pdb format, so imo it is good enough to parse hex resid with hybrid36, since it prevents duplicate resids.

Your suggestion would create another format, I think, because the resids would then have different numbers in MDAnalysis compared to how they were created.

If at all, I'd rather start from a defined standard. This might mean to analyze carefully what VMD and OpenMM are doing, write it up, and then decide if this means to support another parser. (We already have XPDB for extended residue numbers, we could have VMDPDB.)

orbeckst avatar Aug 04 '25 16:08 orbeckst

Adding more support for new variants of an already deprecated file format just feels like spending efforts in the wrong direction.

Would time be better spent getting PDBx support into MDAnalysis? I know that OpenMM supports writing to PDBx.

IAlibay avatar Aug 04 '25 17:08 IAlibay

Letting Universe __init__ to accept optional parser (TopologyReaderBase) argument, so that users can provide custom parsers for non-standard formats, could also solve this issue.

You can always provide a parser-subclass of TopologyReaderBase as a kwarg for Universe for topology_format=YourParser (and similar for the trajectory format=YourReader).

Thank you. I will do this.

If I read your issue and the linked openmm issue/PR correctly then you're really asking for supporting reading of VMD-style hex numbered PDBs, right?

I meant just using existing hy36 format implementation for resids. Using currently existing hybrid36 to decode hex resids (from OpenMM/VMD etc) prevent duplicate resids by not failing and does not require any implementations except editing resid to also use hybrid36 . (for context, in my case, all water hex resids failed to parse and were treated as resid 1, which caused problems due to duplicate resids)

Would time be better spent getting PDBx support into MDAnalysis? I know that OpenMM supports writing to PDBx.

There seems to be OpenMM converter that uses OpenMM to convert PDBx into MDAnalysis topology.

Adding more support for new variants of an already deprecated file format just feels like spending efforts in the wrong direction.

I agree.

FYI, after reading #1897 again, I noticed that #1897 was initially about hexadecimal format, not hybrid36. (similar to https://github.com/openmm/openmm/issues/659#issuecomment-59846803)

Here's a really funny one packmol writes PDB indices in hexadecimal when they get too high. PDB really is the format that never stops giving

But I guess it never caused problems, since hy36 does decode hex into unique integers. (I'm not saying that hex format needs to be implemented though.)

TL;DR, I suggest changing https://github.com/MDAnalysis/mdanalysis/blob/c4af00babc9fc69671f1ea182b8b3a9b8807230d/package/MDAnalysis/topology/PDBParser.py#L302-L304 to use hybrid36 as I mentioned

rladbtls1998 avatar Aug 05 '25 04:08 rladbtls1998

Thanks for digging through all these old discussions. Amongst them was the comment https://github.com/MDAnalysis/mdanalysis/issues/1897#issuecomment-401818711

Does it make sense to allow this format also for residue ids?

The purist in me says "no, it's not a well-defined standard" ... but then I re-read the discussion on #1897 where we essentially did exactly what you're asking to do here: deal with odd files and just make them work.

I wholeheartedly agree that we should support PDBx/mmcif #2367 .

I also admit that your proposed fix seems quite straightforward and may not really influence much the speed at which issue #2367 is being solved. It also does not look as if it will break anything except that the actual resids will be numerically different (in base 10) from what VMD/openmm originally wrote.

In the light of the history of #1897 I am tending to support your change but I'd like to hear other @MDAnalysis/coredevs to weigh in — a simple 👍 or 👎 would be good, comments even better!

orbeckst avatar Aug 05 '25 16:08 orbeckst

My only concern would be in potential performance hit for parsing a large topology (likely minor), but ultimately the topology parsing is rarely the performance bottleneck so I think it's a sensible change.

BradyAJohnston avatar Aug 05 '25 16:08 BradyAJohnston

I'm going to take a negative stance here. I understand that it's just a "small fix", but it's always "just another small fix" with PDB files, for some off standard adjustment that someone somewhere decided to randomly make because it was convenient for their tool at the time.

I will change my opinion to a positive one if: 0. Performance isn't affected for standard complying PDB parsing

  1. This is the last off standard feature we allow in the PDB parser. If someone comes tomorrow with another feature that sits outside PDB V3, we reject it.
  2. This isn't just a fix, but also a ton of documentation on what this new off standard feature is, where it comes from, and how it behaves.

IAlibay avatar Aug 05 '25 17:08 IAlibay

I think it’s reasonable to “fix” and fall back to the hybrid36 format for resid IDs, given that atom IDs already support the hybrid36 format.

To avoid any performance issues and maintain consistency with how atom IDs are parsed, the most sensible fix would be:

                        try:
                            resid = int(line[22:26])
                        except ValueError:
                            resid = hy36decode(4, line[22:26])

At the very least, this would ensure the correct number of residues is identified, rather than lumping everything with a nonstandard hex format into a single residue ID.

If we want to properly support hexadecimal (rather than silently interpreting it as hybrid36), we should introduce a new flag, e.g., parse_hex=True. However, that would require significant documentation effort, and I agree with Irfan that it’s not worth it for a deprecated format.

yuxuanzhuang avatar Aug 05 '25 20:08 yuxuanzhuang

  1. Performance isn't affected for standard complying PDB parsing

I ran a simple benchmark test for hy36decode. Using hy36decode for decimal numbers does make the parsing about 2.3 times slower compared to int(). But:

  1. parsing 10000 numbers takes about 0.1 seconds using int() and using try~except suggested by @yuxuanzhuang can mitigate this (while try-except makes hybrid36 number parsing 1.2 times slower)
  2. hybrid36 is already used for atom IDs, so I think it is more consistent to support hy36 for resid and wouldn't cause too much trouble.

I'm fine with using topology_format=YourParser for Universe, so no problem either way.

rladbtls1998 avatar Aug 06 '25 06:08 rladbtls1998

@rladbtls1998 could you provide an example of such PDB file?

#4712 implements gemmi-based reading of files, and I assume gemmi might also support your usecase, but I need a file to test that (and I'd love to add it to the test set).

Or otherwise, you can test it yourself if you checkout #4712 and do explicit

import MDAnalysis as mda

u = mda.Universe('my.pdb', format='mmcif')

marinegor avatar Nov 06 '25 12:11 marinegor

@marinegor

I think you can just use water molecules in with resid >9999 if you want to use it as a test data.

1f0u_solv.pdb

rladbtls1998 avatar Nov 09 '25 09:11 rladbtls1998

@rladbtls1998 thanks!

as I understand, the file you sent has int('A32E', base=16) - 30960 + (635-501) = 10948 water residues (134 in chain D and 10814 in chain E, as I checked manually assuming the residue numbering for D and E chains is consequential). Seems that legacy reader indeed gives wrong results, while gemmi does not:

import MDAnalysis as mda

u_gemmi = mda.Universe("1f0u_solv.pdb", format="pdb_gemmi")
u_pdb = mda.Universe("1f0u_solv.pdb")

print("legacy pdb:", len(u_pdb.select_atoms("resname HOH").residues))
print("gemmi:", len(u_gemmi.select_atoms("resname HOH").residues))

# legacy pdb: 10134
# gemmi: 10948

seemingly, gemmi backend gives correct results in this case.

Could you confirm it solves your issue?

marinegor avatar Nov 09 '25 19:11 marinegor

@marinegor Thank you. There seems to be 814 HOH residues with hex resids, so it checks out. And Gemmi seems to use hybrid-36, so probably can consider this issue as solved.

As a side note, I'm not sure why those residues are missing in legacy pdb of your code . They are supposed to exist but only with resid = 1. Maybe AtomGroup.residues get residues by resid. Originally the resid would be 1 as https://github.com/MDAnalysis/mdanalysis/blob/c4af00babc9fc69671f1ea182b8b3a9b8807230d/package/MDAnalysis/topology/PDBParser.py#L302-L304 https://github.com/MDAnalysis/mdanalysis/blob/c4af00babc9fc69671f1ea182b8b3a9b8807230d/package/MDAnalysis/topology/PDBParser.py#L308-L311 alphabet in the resid section would raise error and set resid to 1. So, checking HOH residues with resid == 1 would be more appropriate if you want to add it to the test data.

rladbtls1998 avatar Nov 11 '25 09:11 rladbtls1998

@rladbtls1998 thanks, I've added resid == 1 test as well, and requested review by fellow devs.

Is there a chance you could look at #5142 and its documentation, and give your feedback on if you think the changes have been properly documented?

marinegor avatar Nov 11 '25 09:11 marinegor

@marinegor Yes I think so, while I only looked at parts related to PDBParser.

It would be also good to explicitly mention or suggest to use the gemmi MMCIF parser for pdb files with many number of residues (large systems or explicit solvent), because users may not know that parsing was wrong and blame other parts; I initially thought it was error in wrapping, since it was where I noticed something was wrong.

rladbtls1998 avatar Nov 12 '25 02:11 rladbtls1998

@orbeckst I'd suggest we change behavior from warning to raising here, and in exception message mention that people should use gemmi for these cases, also adding option to use pdb_legacy reader if you want to explicitly restore old behavior.

marinegor avatar Nov 14 '25 10:11 marinegor