vermouth-martinize
vermouth-martinize copied to clipboard
problem parsing `MODEL` entry pdb header
Martinize fails at some MODEL
entries in the pdb header. I did not figure out the exact pattern yet, but if the model number consists out of only a single digit, and is located at column 14, martinize fails.
e.g. MODEL 9
gives an error, but MODEL 1
works.
however also MODEL 11
fails.
Thanks for the report. Could you post a traceback of the error at MODEL 9 and MODEL 11? And also the corresponding lines in the pdb file?
After seeing the files, it turned out to be caused by the fixed width file format: for e.g. MODEL 50001
it would only parse 001
, resulting in the model being set wrong. (I didn't count columns for the example).
My suggestions for this are twofold:
- Relax the MODEL parsing method to simply take everything after
MODEL
as the model number, rather than just the specified columns. - Always parse the first model in the PDB file, and if no model index is specified/requested, return that model. Or just always return single models, even if the model number doesn't match.
@jbarnoud @fgrunewald thoughts/input?
The problem you describe looks like a misformatted input file. I would suggest issuing a warning when a misformatted line is detected (with a tolerance for when the model index is missing), error when modelidx
is provided but is ambiguous, and add a modelserial
that ignores the model index provided in the file, but just counts how many models have been seen so far.
I think I agree with Jon on this. Although I am split. It will be a problem when you reach the point that in order to run martinize2 you'll have to have set like 15 flags and ignore maxwarning. This will be quite a problem for any automated script and high throughput methods. But the point here really is how common is this issue. If it is very common I'd say relax the parser and be gentle to the users. If this is a 1 in a 1000 issue then probably warning is the better way to go. Related to this are there situations when we expect there to be another column behind the model idx?
The problem you describe looks like a misformatted input file.
Absolutely.
when
modelidx
is provided but is ambiguous
I don't see how the modelidx can be ambiguous. If you give one that's the model you want, and no other. If you give no modelidx you don't really care which model you want, and we give you the first (?)
... add a
modelserial
that...
Why? I don't see the added value.
But the point here really is how common is this issue.
That's one side of it. The other is whether there's any ambiguity about what the user wants/expects to happen. If I give M2 a pdb with a single model (number 1234124231), I expect it to read that without having to bother with CLI flags, unless I explicitly say I want model number 5.
Related to this are there situations when we expect there to be another column behind the model idx?
We don't expect model numbers to span more than 3 columns to begin with, because that's what PDB spec allows. We could add some acrobatics by saying we take model = int(line[len('MODEL'):].strip())
, and if that doesn't work we take model = int(line[10:14])
(or whatever). And if that also doesn't work we give up.
Here are the different cases I can think of, and what I think the behaviour should be:
CLI model | PDB models | Resulting pdb model |
---|---|---|
1 | 1 | 1 |
2 | 1 2 3 | 2 |
5 | 1 2 3 | None |
* | None | None |
None | 1 | 1 |
None | 1 2 3 | 1? |
None | 1000 | 1000 |
None | 1001 1000 1002 | 1001? |
None | 1 1 1 1 | ??? |
The only thing I'm not completely sure about is what to do if you don't ask for an explicit model number, but the PDB defines multiple. I think that it should just give the first model it reads and print an informative INFO statement.
I don't think it's worthwhile to start fully validating the pdb files we read and issue warnings for every little thing that's off. For example, officially the occupancy, temperature factor, and the element fields are obligatory but not always provided by e.g. ligpargen.
Note that with the current parser we can't first read all the model lines and decide based on that. The parser skips ATOM lines for models with the wrong number. Another case this change may encounter is the case where there are multiple models with number N. Definitely misformatted, definitely ambiguous, but this change proposes to make the first model encountered different from all the others so we could decide that it's special enough to be returned with a warning. Or we could concatenate all those atom records and treat them as being the same model, or ...
I don't see how the modelidx can be ambiguous. If you give one that's the model you want, and no other. If you give no modelidx you don't really care which model you want, and we give you the first (?)
If the line[xx:yy] != line.split()[1]
then the model record is not formatted correctly, and if the model idx is 0-padded it is very likely to have been truncated, because of how PDB files truncate the most significant digits when an index goes out of range.
... add a modelserial that...
Why? I don't see the added value.
If the modelidx is ambiguous, then some models cannot be accessed by specifying a modelidx. The modelidx is 4 columns in the PDB spec, so if I want to access model 100000, it will be written 0000 in the file and I cannot have a match. But I can access it using a modelserial.
It also fixes the issue of manually edited PDB files that would have multiple models without a modelidx, or with miltuple model bearing the same moldelidx.
Note that, in an ideal PDB file, the modelidx is a serial already: the issue here is with odly formated PDB files.
For the record: https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#MODEL
Ok, so the question becomes; when the user from the CLI says they want model 3, do they mean modelidx 3 or model serial 3? And most users will nog be aware of the exact pdb format and the associated truncation. My goal is not to detect when and how PDB files are misformatted, but rather to work around it as best as possible, so long as it's unambiguous. Once it becomes ambiguous it should be either a warning, or even an error.
IF misformatted_model AND model_idx_requested:
IF 1 model in pdb AND found_model == requested_model:
WARNING
ELSE:
ERROR
ELSE
INFO(parsed model x)
?
Works for me.
Thanks that you are all looking into it!
@pckroon has this been solved?
No, I never got around to implementing this.