vermouth-martinize icon indicating copy to clipboard operation
vermouth-martinize copied to clipboard

problem parsing `MODEL` entry pdb header

Open vaneerdf opened this issue 4 years ago • 11 comments

Martinize fails at some MODEL entries in the pdb header. I did not figure out the exact pattern yet, but if the model number consists out of only a single digit, and is located at column 14, martinize fails.

e.g. MODEL 9 gives an error, but MODEL 1 works. however also MODEL 11 fails.

vaneerdf avatar Nov 05 '20 07:11 vaneerdf

Thanks for the report. Could you post a traceback of the error at MODEL 9 and MODEL 11? And also the corresponding lines in the pdb file?

pckroon avatar Nov 05 '20 09:11 pckroon

After seeing the files, it turned out to be caused by the fixed width file format: for e.g. MODEL 50001 it would only parse 001, resulting in the model being set wrong. (I didn't count columns for the example).

My suggestions for this are twofold:

  1. Relax the MODEL parsing method to simply take everything after MODEL as the model number, rather than just the specified columns.
  2. Always parse the first model in the PDB file, and if no model index is specified/requested, return that model. Or just always return single models, even if the model number doesn't match.

@jbarnoud @fgrunewald thoughts/input?

pckroon avatar Nov 11 '20 10:11 pckroon

The problem you describe looks like a misformatted input file. I would suggest issuing a warning when a misformatted line is detected (with a tolerance for when the model index is missing), error when modelidx is provided but is ambiguous, and add a modelserial that ignores the model index provided in the file, but just counts how many models have been seen so far.

jbarnoud avatar Nov 11 '20 10:11 jbarnoud

I think I agree with Jon on this. Although I am split. It will be a problem when you reach the point that in order to run martinize2 you'll have to have set like 15 flags and ignore maxwarning. This will be quite a problem for any automated script and high throughput methods. But the point here really is how common is this issue. If it is very common I'd say relax the parser and be gentle to the users. If this is a 1 in a 1000 issue then probably warning is the better way to go. Related to this are there situations when we expect there to be another column behind the model idx?

fgrunewald avatar Nov 12 '20 08:11 fgrunewald

The problem you describe looks like a misformatted input file.

Absolutely.

when modelidx is provided but is ambiguous

I don't see how the modelidx can be ambiguous. If you give one that's the model you want, and no other. If you give no modelidx you don't really care which model you want, and we give you the first (?)

... add a modelserial that...

Why? I don't see the added value.

But the point here really is how common is this issue.

That's one side of it. The other is whether there's any ambiguity about what the user wants/expects to happen. If I give M2 a pdb with a single model (number 1234124231), I expect it to read that without having to bother with CLI flags, unless I explicitly say I want model number 5.

Related to this are there situations when we expect there to be another column behind the model idx?

We don't expect model numbers to span more than 3 columns to begin with, because that's what PDB spec allows. We could add some acrobatics by saying we take model = int(line[len('MODEL'):].strip()), and if that doesn't work we take model = int(line[10:14]) (or whatever). And if that also doesn't work we give up.

Here are the different cases I can think of, and what I think the behaviour should be:

CLI model PDB models Resulting pdb model
1 1 1
2 1 2 3 2
5 1 2 3 None
* None None
None 1 1
None 1 2 3 1?
None 1000 1000
None 1001 1000 1002 1001?
None 1 1 1 1 ???

The only thing I'm not completely sure about is what to do if you don't ask for an explicit model number, but the PDB defines multiple. I think that it should just give the first model it reads and print an informative INFO statement.

I don't think it's worthwhile to start fully validating the pdb files we read and issue warnings for every little thing that's off. For example, officially the occupancy, temperature factor, and the element fields are obligatory but not always provided by e.g. ligpargen.

Note that with the current parser we can't first read all the model lines and decide based on that. The parser skips ATOM lines for models with the wrong number. Another case this change may encounter is the case where there are multiple models with number N. Definitely misformatted, definitely ambiguous, but this change proposes to make the first model encountered different from all the others so we could decide that it's special enough to be returned with a warning. Or we could concatenate all those atom records and treat them as being the same model, or ...

pckroon avatar Nov 12 '20 11:11 pckroon

I don't see how the modelidx can be ambiguous. If you give one that's the model you want, and no other. If you give no modelidx you don't really care which model you want, and we give you the first (?)

If the line[xx:yy] != line.split()[1] then the model record is not formatted correctly, and if the model idx is 0-padded it is very likely to have been truncated, because of how PDB files truncate the most significant digits when an index goes out of range.

... add a modelserial that...

Why? I don't see the added value.

If the modelidx is ambiguous, then some models cannot be accessed by specifying a modelidx. The modelidx is 4 columns in the PDB spec, so if I want to access model 100000, it will be written 0000 in the file and I cannot have a match. But I can access it using a modelserial.

It also fixes the issue of manually edited PDB files that would have multiple models without a modelidx, or with miltuple model bearing the same moldelidx.

Note that, in an ideal PDB file, the modelidx is a serial already: the issue here is with odly formated PDB files.

For the record: https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#MODEL

jbarnoud avatar Nov 12 '20 11:11 jbarnoud

Ok, so the question becomes; when the user from the CLI says they want model 3, do they mean modelidx 3 or model serial 3? And most users will nog be aware of the exact pdb format and the associated truncation. My goal is not to detect when and how PDB files are misformatted, but rather to work around it as best as possible, so long as it's unambiguous. Once it becomes ambiguous it should be either a warning, or even an error.

IF misformatted_model AND model_idx_requested:
    IF 1 model in pdb AND found_model == requested_model:
        WARNING
    ELSE:
        ERROR
ELSE
    INFO(parsed model x)

?

pckroon avatar Nov 13 '20 14:11 pckroon

Works for me.

jbarnoud avatar Nov 13 '20 14:11 jbarnoud

Thanks that you are all looking into it!

vaneerdf avatar Nov 16 '20 04:11 vaneerdf

@pckroon has this been solved?

fgrunewald avatar Mar 08 '23 20:03 fgrunewald

No, I never got around to implementing this.

pckroon avatar Mar 13 '23 15:03 pckroon