alphafold
alphafold copied to clipboard
Improper chain IDs when running Multimer model
When I've predicted some multimer structures, I've noticed that the chain IDs begin with B rather than A.
I was able to do a small fix in the code, but it might not be the best place for this change:
Line 265 in alphafold/common/protein.py: chain_index = _maybe_remove_leading_dim(features['asym_id']) -> chain_index = _maybe_remove_leading_dim(features['asym_id'])-1
https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L144 ~Changing this line may be better.~ (Edit: Changing codes in early steps introduces bugs easily & other ids also start from 1, thus it may be not so good. Anyway the source of the phenomena is the line.) ↓is my account for daytime work.
I was going to model the asymmetric unit of 6gs2 https://www.rcsb.org/structure/6gs2 . It has 4 chains, A, B, C, and D. Amino acid sequences of A and C are 100 % identical. & Amino acid sequences of B and D are 100 % identical, too.
I tried to process the resulting output .pdb file, & encountered error that chains have unexpected amino acid sequences.
The asym_id, which is used to label the chains in output file is defined in add_assembly_features function. https://github.com/deepmind/alphafold/blob/9c4ac8a92125942f73813649d9f6885532c1ee97/alphafold/data/pipeline_multimer.py#L119
Around here.
for chain_id, chain_features in all_chain_features.items():
seq = str(chain_features['sequence'])
if seq not in seq_to_entity_id:
seq_to_entity_id[seq] = len(seq_to_entity_id) + 1
grouped_chains[seq_to_entity_id[seq]].append(chain_features)
new_all_chain_features = {}
chain_id = 1
for entity_id, group_chain_features in grouped_chains.items():
for sym_id, chain_features in enumerate(group_chain_features, start=1):
new_all_chain_features[
f'{int_id_to_str_id(entity_id)}_{sym_id}'] = chain_features
seq_length = chain_features['seq_length']
chain_features['asym_id'] = chain_id * np.ones(seq_length)
chain_features['sym_id'] = sym_id * np.ones(seq_length)
chain_features['entity_id'] = entity_id * np.ones(seq_length)
chain_id += 1
The order of chains become A, C, B, and D because chain_id is set sequentially. And original A, C, B, and D chains are labeled as B, C, D, and E in the output .pdb files, respectively.
My input fasta is
>M6gs2_2_A A
MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH
>M6gs2_2_B B
MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS
>M6gs2_2_C C
MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH
>M6gs2_2_D D
MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS
.
The contents of msas/chain_id_map.json is
{
"A": {
"description": "M6gs2_2_A A",
"sequence": "MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH"
},
"B": {
"description": "M6gs2_2_B B",
"sequence": "MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS"
},
"C": {
"description": "M6gs2_2_C C",
"sequence": "MHELTIYHFMSDKLNLYSDIGNIIALRQRAKKRNIKVNVVEINETEGITFDECDIFFIGGGSDREQALATKELSKIKTPLKEAIEDGMPGLTICGGYQFLGKKYITPDGTELEGLGILDFYTESKTNRLTGDIVIESDTFGTIVGFENHGGRTYHDFGTLGHVTFGYGNNDEDKKEGIHYKNLLGTYLHGPILPKNYEITDYLLEKACERKGIPFEPKEIDNEAEIQAKQVLIDRANRQKKSRLEHHHHHH"
},
"D": {
"description": "M6gs2_2_D D",
"sequence": "MRQWTAIHLAKLARKASRAVGKRGTDLPGQIARKVDTDVLRKLAEQVDDIVFISGTNGKTTTSNLIGHTLKANNIQIIHNNEGANMAAGITSAFIMQSTPKTKIAVIEIDEGSIPRVLKEVTPSMMVFTNFFRDQMDRFGEIDIMVNNIAETISNKGIKLLLNADDPFVSRLKIASDTIVYYGMKAHAHEFEQSTMNESRYCPNCGRLLQYDYIHYNQIGHYHCQCGFKREQAKYEISSFDVAPFLYLNINDEKYDMKIAGDFNAYNALAAYTVLRELGLNEQTIKNGFETYTSDNGRMQYFKKERKEAMINLAKNPAGMNASLSVGEQLEGEKVYVISLNDNAADGRDTSWIYDADFEKLSKQQIEAIIVTGTRAEELQLRLKLAEVEVPIIVERDIYKATAKTMDYKGFTVAIPNYTSLAPMLEQLNRSFEGGQS"
}
}
.
I uploaded rank_0.pdb here. https://gist.github.com/t-oda-ic/dc3cf683ce596dd57763a5af77d764b5
In summary, the original chain information is completely missing in the output.
I can solve this problem just parsing amino acid sequences in the output pdb file but I'd appreciate if official script could provide some information for input-output mapping.
Concerning @nzrandol's initial report, I'm thinking that the issue may come from https://github.com/deepmind/alphafold/blob/18e12d61314214c51ca266d192aad3cc6619018a/alphafold/common/protein.py#L177 which is involved in the writing of the PDB file on disk.
Note that, in the lines referred to above by @yamule,
https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L143-L153
although numerical chain IDs start at 1, the int_id_to_str_id
function maps 1 -> A, 2 -> B, .... And, indeed, chain IDs in alphabetical format are correct in the intermediate pickle files written to disk by alphafold. On the other hand, in the code from common/protein.py
one can see that the mapping back from ints to letters (for writing on the PDB) is given by 0 -> A, 1 -> B, .... By changing to
chain_ids[i] = PDB_CHAIN_IDS[i - 1]
I was able to solve the issue in my examples. It would be great if people could weigh in on whether my interpretation is correct, and whether this should be fixed in the codebase.
OTOH, @t-oda-ic's issue seems different. Does it stem from the fact that, for 100% identical sequences,
https://github.com/deepmind/alphafold/blob/c128d1aa2c21407fbe51d3cd87b85d4c5942056d/alphafold/data/pipeline_multimer.py#L139
will cause a failure to create an entry for seq_to_entity_id
for one of the two identical copies?
I noticed that at some point when running different versions of AF with different set of parameters -- I was wrongly assuming it is just "due to some bad AF version, otherwise why some of my output has correct chains ABCD..., but others are affected by that off-by-one bug? Why majority of AF users doesn't notice that?".
So for now I figured, that it is the relaxation step, where chain IDs are fixed for major amount of users.
If you enable AF relaxation, you can immediately compare unrelaxed
and ranked
PDBs: the former will have off-by-one chains, the latter will have correct chains starting with "A".
So in the out-of-the-box AlphaFold setup scenario there are actually two side effects
- off-by-one bug in AlphaFold PDB writer already mentioned above in this GitHub issue thread
- side effect in relaxation step
that together cause the Chain IDs to look correct in the final PDB file.
This side effect is due to (accidentally or not) another PDB write attempt using OpenMM function omitting one of the optional parameters keepIds
, which falls back to default False
value.
https://github.com/google-deepmind/alphafold/blob/032e2f26732d1473ec5db7ff5b8572754576e459/alphafold/relax/amber_minimize.py#L113-L117
which defaults to keepIds=False
:
https://github.com/openmm/openmm/blob/122bbe40407ca281d6efa1bf98fbcf148a9c4d24/wrappers/python/openmm/app/pdbfile.py#L282-L286
keepIds : bool=False
If True, keep the residue and chain IDs specified in the Topology
rather than generating new ones. Warning: It is up to the caller to
make sure these are valid IDs that satisfy the requirements of the
PDB format. Otherwise, the output file will be invalid.
https://github.com/openmm/openmm/blob/122bbe40407ca281d6efa1bf98fbcf148a9c4d24/wrappers/python/openmm/app/pdbfile.py#L358-L361
if keepIds and len(chain.id) == 1:
chainName = chain.id
else:
chainName = chr(ord('A')+chainIndex%26)
@t-oda-ic, I'd like to clarify one line of yours:
In summary, the original chain information is completely missing in the output.
That is not technically true and we can use it to our benefit. Fact is that we lost information about the chain names, but due to alphabet-based naming we preserve information about the order of the chains.
And this is exactly how keepIds=False
kinda "restores" expected chain IDs and patches the off-by-one error in the original AF module.