alphafold Improper chain IDs when running Multimer model

When I've predicted some multimer structures, I've noticed that the chain IDs begin with B rather than A.

I was able to do a small fix in the code, but it might not be the best place for this change:

Line 265 in alphafold/common/protein.py: chain_index = _maybe_remove_leading_dim(features['asym_id']) -> chain_index = _maybe_remove_leading_dim(features['asym_id'])-1

Nov 12 '21 15:11 nzrandol

https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L144 ~Changing this line may be better.~ (Edit: Changing codes in early steps introduces bugs easily & other ids also start from 1, thus it may be not so good. Anyway the source of the phenomena is the line.) ↓is my account for daytime work.

Jan 29 '22 02:01 yamule

I was going to model the asymmetric unit of 6gs2 https://www.rcsb.org/structure/6gs2 . It has 4 chains, A, B, C, and D. Amino acid sequences of A and C are 100 % identical. & Amino acid sequences of B and D are 100 % identical, too.

I tried to process the resulting output .pdb file, & encountered error that chains have unexpected amino acid sequences.

The asym_id, which is used to label the chains in output file is defined in add_assembly_features function. https://github.com/deepmind/alphafold/blob/9c4ac8a92125942f73813649d9f6885532c1ee97/alphafold/data/pipeline_multimer.py#L119

Around here.

  for chain_id, chain_features in all_chain_features.items():
    seq = str(chain_features['sequence'])
    if seq not in seq_to_entity_id:
      seq_to_entity_id[seq] = len(seq_to_entity_id) + 1
    grouped_chains[seq_to_entity_id[seq]].append(chain_features)

  new_all_chain_features = {}
  chain_id = 1
  for entity_id, group_chain_features in grouped_chains.items():
    for sym_id, chain_features in enumerate(group_chain_features, start=1):
      new_all_chain_features[
          f'{int_id_to_str_id(entity_id)}_{sym_id}'] = chain_features
      seq_length = chain_features['seq_length']
      chain_features['asym_id'] = chain_id * np.ones(seq_length)
      chain_features['sym_id'] = sym_id * np.ones(seq_length)
      chain_features['entity_id'] = entity_id * np.ones(seq_length)
      chain_id += 1

The order of chains become A, C, B, and D because chain_id is set sequentially. And original A, C, B, and D chains are labeled as B, C, D, and E in the output .pdb files, respectively.

My input fasta is

>M6gs2_2_A A
MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH
>M6gs2_2_B B
MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS
>M6gs2_2_C C
MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH
>M6gs2_2_D D
MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS

.

The contents of msas/chain_id_map.json is

{
    "A": {
        "description": "M6gs2_2_A A",
        "sequence": "MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH"
    },
    "B": {
        "description": "M6gs2_2_B B",
        "sequence": "MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS"
    },
    "C": {
        "description": "M6gs2_2_C C",
        "sequence": "MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH"
    },
    "D": {
        "description": "M6gs2_2_D D",
        "sequence": "MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS"
    }
}

.

I uploaded rank_0.pdb here. https://gist.github.com/t-oda-ic/dc3cf683ce596dd57763a5af77d764b5

In summary, the original chain information is completely missing in the output.

I can solve this problem just parsing amino acid sequences in the output pdb file but I'd appreciate if official script could provide some information for input-output mapping.

Feb 17 '22 09:02 t-oda-ic

Concerning @nzrandol's initial report, I'm thinking that the issue may come from https://github.com/deepmind/alphafold/blob/18e12d61314214c51ca266d192aad3cc6619018a/alphafold/common/protein.py#L177 which is involved in the writing of the PDB file on disk.

Note that, in the lines referred to above by @yamule, https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L143-L153 although numerical chain IDs start at 1, the int_id_to_str_id function maps 1 -> A, 2 -> B, .... And, indeed, chain IDs in alphabetical format are correct in the intermediate pickle files written to disk by alphafold. On the other hand, in the code from common/protein.py one can see that the mapping back from ints to letters (for writing on the PDB) is given by 0 -> A, 1 -> B, .... By changing to

 chain_ids[i] = PDB_CHAIN_IDS[i - 1]

I was able to solve the issue in my examples. It would be great if people could weigh in on whether my interpretation is correct, and whether this should be fixed in the codebase.

OTOH, @t-oda-ic's issue seems different. Does it stem from the fact that, for 100% identical sequences, https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L139 will cause a failure to create an entry for seq_to_entity_id for one of the two identical copies?

Feb 21 '23 10:02 ulupo

I noticed that at some point when running different versions of AF with different set of parameters -- I was wrongly assuming it is just "due to some bad AF version, otherwise why some of my output has correct chains ABCD..., but others are affected by that off-by-one bug? Why majority of AF users doesn't notice that?".

So for now I figured, that it is the relaxation step, where chain IDs are fixed for major amount of users. If you enable AF relaxation, you can immediately compare unrelaxed and ranked PDBs: the former will have off-by-one chains, the latter will have correct chains starting with "A".

So in the out-of-the-box AlphaFold setup scenario there are actually two side effects

off-by-one bug in AlphaFold PDB writer already mentioned above in this GitHub issue thread
side effect in relaxation step

that together cause the Chain IDs to look correct in the final PDB file.

This side effect is due to (accidentally or not) another PDB write attempt using OpenMM function omitting one of the optional parameters keepIds, which falls back to default False value.

https://github.com/google-deepmind/alphafold/blob/032e2f26732d1473ec5db7ff5b8572754576e459/alphafold/relax/amber_minimize.py#L113-L117

which defaults to keepIds=False: https://github.com/openmm/openmm/blob/122bbe40407ca281d6efa1bf98fbcf148a9c4d24/wrappers/python/openmm/app/pdbfile.py#L282-L286

        keepIds : bool=False
            If True, keep the residue and chain IDs specified in the Topology
            rather than generating new ones.  Warning: It is up to the caller to
            make sure these are valid IDs that satisfy the requirements of the
            PDB format.  Otherwise, the output file will be invalid.

https://github.com/openmm/openmm/blob/122bbe40407ca281d6efa1bf98fbcf148a9c4d24/wrappers/python/openmm/app/pdbfile.py#L358-L361

if keepIds and len(chain.id) == 1:
    chainName = chain.id
else:
    chainName = chr(ord('A')+chainIndex%26)

@t-oda-ic, I'd like to clarify one line of yours:

In summary, the original chain information is completely missing in the output.

That is not technically true and we can use it to our benefit. Fact is that we lost information about the chain names, but due to alphabet-based naming we preserve information about the order of the chains.

And this is exactly how keepIds=False kinda "restores" expected chain IDs and patches the off-by-one error in the original AF module.

Mar 06 '24 18:03 lildeadprince

alphafold alphafold copied to clipboard

Improper chain IDs when running Multimer model

alphafold
alphafold copied to clipboard