pdb-tools icon indicating copy to clipboard operation
pdb-tools copied to clipboard

pdb_selaltloc removes residues with (only) alternate locations

Open mgiulini opened this issue 3 years ago • 25 comments

Describe the bug pdb_selaltloc removes residues in which all atoms show alternate locations

To Reproduce

grep -rn pdb4xoj.pdb -e "SER A  85"
pdb_selaltloc pdb4xoj.pdb > pdb4xoj-occ.pdb
grep -rn pdb4xoj-occ.pdb -e "SER A  85"

Expected behavior pdb_selaltloc should not delete the residue but rather pick one of the two available alternate locations

Desktop (please complete the following information):

  • OS: Linux
  • Python Version 3.9

pdb4xoj.pdb.zip

mgiulini avatar Dec 06 '22 11:12 mgiulini

Thanks for this @mgiulini I will try to address it ASAP. Cheers,

joaomcteixeira avatar Dec 11 '22 17:12 joaomcteixeira

thank you @joaomcteixeira I just linked the corresponding issue in the artic3d repo

mgiulini avatar Dec 12 '22 07:12 mgiulini

Hi, sorry for the delay in addressing this issue. I have been a bit overwhelmed on other fronts. I will address it in the next few weeks. It seems something that needs to be inspected carefully.

joaomcteixeira avatar Feb 08 '23 09:02 joaomcteixeira

hi Joao! no worries, I could actually try to give a look into it by myself, as I would like to use pdb_selaltloc in another application, what do you say?

mgiulini avatar Feb 08 '23 09:02 mgiulini

This is a serious issue that should be solved.

amjjbonvin avatar Feb 08 '23 09:02 amjjbonvin

I'll look into it!

mgiulini avatar Feb 08 '23 09:02 mgiulini

Go for it. Look carefully at the way @JoaoRodrigues and I designed the tests. I suggest you first write the tests for this new case and then try to correct the code for that. Be sure first that the PDB is formatted well. Some issues with PDBs are unsolvable because the PDB itself is wrongly created. Best,

joaomcteixeira avatar Feb 08 '23 09:02 joaomcteixeira

I understood the problem, but I think it would take ages for me to solve it without breaking the code, as the algorithm is very complicated.

the problem occurs when flush_resloc_occ is called https://github.com/haddocking/pdb-tools/blob/2a070bbacee9d6608b44bb6d2f749beefd6a7690/pdbtools/pdb_selaltloc.py#L302-L318

You can clearly see from the code from the code that if the input dictionary altloc_lines has a key with highest possible occupancy (e.g., an atom with no alternate location, occupancy = 1.0), then the function will ignore all the other lines, no matter their occupancy. As an example, residues LYS84 and SER85 of the pdb 4xoj give rise to the following lines

{' ': ['ATOM    536  N   LYS A  84      -4.827  -2.055  19.735  1.00  9.68           N  \n', ... some other coordinates], 
'A': ['ATOM    537  CA ALYS A  84      -4.644  -3.431  19.279  0.70 10.05           C  \n', ... some other coordinates],
'B': ['ATOM    538  CA BLYS A  84      -4.643  -3.451  19.329  0.30 10.14           C  \n', ... some other coordinates]}


according to the function, the first key (' ') is selected and the others are simply ignored.

Possible solutions could be:

  • introduce an if clause in this function to always preserve the atoms with 1.0 (key = " "), and check only among the other keys which is the one has the highest occ
  • drop the flush_func_multi_residues logic and read the residues sequentially (according to their number)

I tried the second option, but it broke the tests..

mgiulini avatar Feb 08 '23 11:02 mgiulini

I would say the first option sounds better. No problem accommodating the if clause.

Also, feel free to adventure yourself rewriting the inner code if you feel so :smiling_imp: . As long as the already implemented test cases pass and the API itself is kept. But first, i suggest trying the if-clause to see what happens. Good luck and many thanks

joaomcteixeira avatar Feb 08 '23 11:02 joaomcteixeira

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

My question: Is this a correct altloc formatting? @JoaoRodrigues @amjjbonvin need your help with this.

If yes, I will need to rewrite the script considerably. It's not a problem, but I want to double-check with you before doing it.

Cheers!

joaomcteixeira avatar Mar 27 '23 23:03 joaomcteixeira

This is a weird one… But if it comes from the PDB then we should ideally handle it.

My question: Is this a correct altloc formatting? @JoaoRodrigues https://github.com/JoaoRodrigues @amjjbonvin https://github.com/amjjbonvin need your help with this.

amjjbonvin avatar Mar 28 '23 05:03 amjjbonvin

As usual, I went to bed and woke up thinking about it. I am tempted to say it is complicated (or even impossible) to cover this scenario in the current pdb_selaltloc implementation without twisting the algorithm. The current algorithm is not complex. It's an engine that relies on identifying when a new altloc group appears code. But yesterday I couldn't figure out a way to cover this case cleanly and without adding patch code here and there.

I am tempted to rewrite the whole script and return it to something similar to the original implementation: first collect, then flush; contrarily, to flush-on-the-fly as it is now. Obviously, considering all the new tests and edge cases encountered in the last years that contributed to this change in the algorithm.

I will keep you posted. I will work on this today.

joaomcteixeira avatar Mar 28 '23 07:03 joaomcteixeira

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two alternate locations correspond to different amino acids (such as residue 11 of https://www.ebi.ac.uk/pdbe/entry-files/pdb1alx.ent), but in that case you have different location labels (ATYR and BTRP). I would not remove full residues just because at the same residue ID there's another amino acid with some alternate locations. But this is up to your design choices of course:)

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

have you checked my solution in https://github.com/haddocking/pdb-tools/pull/154? probably not super elegant, but it might be a good starting point

mgiulini avatar Mar 28 '23 07:03 mgiulini

Is there? Or are both LEU25 and ILE25 at the same position in space? This calls for some visual inspection

Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two

amjjbonvin avatar Mar 28 '23 08:03 amjjbonvin

It is there indeed (ILE25 and LEU25 are almost superimposed), but in the original file https://files.rcsb.org/view/3U7T.pdb ILE25 has the alternate location label A, so basically there are AILE and BLEU at position 25. It is in this case that it makes sense to select only one of them.

mgiulini avatar Mar 28 '23 08:03 mgiulini

We are talking about the test case

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

https://github.com/haddocking/pdb-tools/blob/2a070bbacee9d6608b44bb6d2f749beefd6a7690/tests/data/dummy_altloc.pdb#L34-L49

It is the same as for ASER 22 and BPRO 22 at the beginning of the file, but in this case, the first alternative location is .

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

joaomcteixeira avatar Mar 28 '23 11:03 joaomcteixeira

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

here is the question: should we consider the simple space " " an alternative location? I would not do that, but it's a design choice.

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

sure, as you wish

mgiulini avatar Mar 28 '23 11:03 mgiulini

I think there's some confusion about the heterogeneity possible in PDB files. Altlocs, specifically and by design, are meant to represent multiple instances of the same atom (same chain, resnum, inscode, atom name) with different occupancies. Indeed, the spec says that if there are multiple altlocs, they all should have non-blank altloc letters, but this is clearly not the case in several files in the wild.

In this case specifically that you mention Marco, I would consider these two instances of the same residue. Basically, my guess is that the authors couldn't figure out it this is a ILE or a LEU and decided to put both possibilities in.

My solution for this is to treat those ILE and LEU as different conformations of the same residue (they do have partial occupancies that add to 1) and simply pick the one with highest occupancy. If you abuse the format, well, tough luck.. We should identify residues as (chain, resnum, inscode), which is what the spec says.

I'll chat with João offline about a good implementation.

JoaoRodrigues avatar Mar 28 '23 12:03 JoaoRodrigues

Hi Joao, yes, your solution makes sense..if the user submits a pdb with two amino acids having same chain-resnum..well, there's nothing you can do about it the important thing is that the code should never delete a full residue (such as SER A 85 in the example) but rather pick one of the available altlocs

mgiulini avatar Mar 29 '23 08:03 mgiulini

We are having some discussions on slack, but this one is relevant to register here:

In the case PDB @mgiulini sent. For example, for SER 85, altlocs A and B have the same occupancy of 0.50. What should we do for these cases when users run pdb_selaltloc without specifying a selection option, that is, select by occupancy?

  1. Should we select the first conformation?
  2. Should we select one of the conformations randomly?
  3. Should we output both and maintain the altloc character? (note altloc characters are deleted after selaltloc)

joaomcteixeira avatar Mar 29 '23 13:03 joaomcteixeira

Pick the first.

Message ID: @.***>

JoaoRodrigues avatar Mar 29 '23 13:03 JoaoRodrigues

Simply select the 1st one

amjjbonvin avatar Mar 29 '23 15:03 amjjbonvin