Predicted structure is way different from the RSCB PDB structure.
Hi, First of all, thank you for such amazing open source efforts. Also, I am very, very new to this domain, just started learning about protein structure prediction and folding.
While doing experiments with I used this random example of insulin complex [pdb_00008ez0] and I downloaded it's fasta sequence:
>8EZ0_1|Chains A, B|Insulin receptor|Mus musculus (10090)
HLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNEECGDVCPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHKECLGNCSEPDDPTKCVACRNFYLDGQCVETCPPPYYHFQDWRCVNFSFCQDLHFKCRNSRKPGCHQYVIHNNKCIPECPSGYTMNSSNLMCTPCLGPCPKVCQILEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGFLKIRRSYALVSLSFFRKLHLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDQASCENELLKFSFIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPQRSNDPKSQTPSHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVYWERQAEDSELFELDYCLKGLKLPSRTWSPPFESDDSQKHNQSEYDDSASESSSSPKTDSQILKELEESSFRKTFEDYLHNVVFVPRPSRKRRSLEEVGNVTATTLTLPDFPNVSSTIVPTSQEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDSPDERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRVRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIGPLIFVFLFSVVIGSIYLFLRKRQPDGPMGPLYASSNPEYLSASDVFPSSVYVPDEWEVPREKITLLRELGQGSFGMVYEGNAKDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKGFTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSHLRSLRPDAENNPGRPPPTLQEMIQMTAEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPVRWMSPESLKDGVFTASSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDPPDNCPERLTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFYSEENKAPESEELEMEFEDMENVPLDRSSHCQREEAGGREGGSSLSIKRTYDEHIPYTHMNGGKKNGRVLTLPRSNPS
>8EZ0_2|Chains C[auth D], D[auth E], E[auth F], F[auth G]|Insulin|Homo sapiens (9606)
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
And I ran this command:
colabfold_batch --templates --amber insulin_sample.fasta results/
Everything ran successfully, I got a descent pLDDT score of 77.1, and I downloaded this PDB:
8EZ0_2_Chains_C_auth_D___D_auth_E___E_auth_F___F_auth_G__Insulin_Homo_sapiens__9606__relaxed_rank_001_alphafold2_ptm_model_1_seed_000.pdb
which to my knowledge is the most stable predicted structure. However when I visualize the PDB, I get something like this
Green one is the visualization of the generated structure and the complex structure in the back is the actual from RSCB PDB.
My question is:
- What am I doing wrong here? Am I interpreting things differently?
- Did my input fasta is wrong, should I have used
:for multimers? - Anything else I have to be aware of in general when I am doing prediction of complex molecules or multimers?
Because the same thing happened in the case of hemoglobin complex as well.
Hi. It runs a prediction for each fasta sequence so yes you will need to use a colon. Also, you will need to give it two copies of the IR sequence, as it is a dimer, as well as truncating the signal peptide so that it starts with HLY.... as well as including a chain break just after the furin cleavage site (RKRR). Similarly, you will need to remove the insulin c-peptide, resulting in an A chain and B chain. When I teach students about AlphaFold I use this is example when to demonstrate the importance of knowing the system that you are predicting. PTMs and oligomerization state are critically important but not found in the primary sequence.