alphafold
alphafold copied to clipboard
AlphaFold (or hmmsearch) is not able to parse some pdb_seqres.txt due to unusual residue naming
Dear all, I had trouble running a prediction with updated pdb_seqres.txt files since some entries contain unusual DNA residue names, PDB code 7ooo, 7oos and 7ozz. These nucleic acids are modified residues but do not follow DNA alphabet, so the parser fails with an error on the letter "0" (zero)
Traceback here and details below:
Traceback (most recent call last):
File "/app/alphafold/run_alphafold.py", line 422, in
hmmsearch :: search profile(s) against a sequence database
HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
query HMM file: /tmp/tmp2i0w1r3m/query.hmm
target sequence database: /scratch/shared/dataset/alphafold_data/pdb_seqres/pdb_seqres.txt
MSA of all hits saved to file: /tmp/tmp2i0w1r3m/output.sto
show alignments in output: no
sequence reporting threshold: E-value <= 100
domain reporting threshold: E-value <= 100
sequence inclusion threshold: E-value <= 100
domain inclusion threshold: E-value <= 100
MSV filter P threshold: <= 0.1
Vit filter P threshold: <= 0.1
Fwd filter P threshold: <= 0.1
number of worker threads: 8
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Query: query [M=242]
stderr: Parse failed (sequence file /scratch/shared/dataset/alphafold_data/pdb_seqres/pdb_seqres.txt): Line 1364756: illegal character 0
After manually editing the file to remove the "05H" character (the modified DNA nucleotide) the error is gone. Here is a full diff:
diff -Naup pdb_seqres/pdb_seqres.txt-orig pdb_seqres/pdb_seqres.txt --- pdb_seqres/pdb_seqres.txt-orig 2022-09-13 00:19:53.000000000 +0200 +++ pdb_seqres/pdb_seqres.txt 2022-09-13 00:36:37.000000000 +0200 @@ -1360655,9 +1360655,9 @@ CAAAGAAAAG
7ooo_D mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7ooo_B mol:na length:11 DNA (5'-D(CPTP*(RWQ)PTPCPTPTPTPG)-3') -CT05ATCTTTG +CTATCTTTG 7ooo_E mol:na length:11 DNA (5'-D(CPTP*(RWQ)PTPCPTPTPTPG)-3') -CT05ATCTTTG +CTATCTTTG 7oop_A mol:protein length:1970 DNA-directed RNA polymerase II subunit RPB1 MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPETTEGGRPKLGGLMDPRQGVIERTGRCQTCAGNMTECPGHFGHIELAKPVFHVGFLVKTMKVLRCVCFFCSKLLVDSNNPKIKDILAKSKGQPKKRLTHVYDLCKGKNICEGGEEMDNKFGVEQPEGDEDLTKEKGHGGCGRYQPRIRRSGLELYAEWKHVNEDSQEKKILLSPERVHEIFKRISDEECFVLGMEPRYARPEWMIVTVLPVPPLSVRPAVVMQGSARNQDDLTHKLADIVKINNQLRRNEQNGAAAHVIAEDVKLLQFHVATMVDNELPGLPRAMQKSGRPLKSLKQRLKGKEGRVRGNLMGKRVDFSARTVITPDPNLSIDQVGVPRSIAANMTFAEIVTPFNIDRLQELVRRGNSQYPGAKYIIRDNGDRIDLRFHPKPSDLHLQTGYKVERHMCDGDIVIFNRQPTLHKMSMMGHRVRILPWSTFRLNLSVTTPYNADFDGDEMNLHLPQSLETRAEIQELAMVPRMIVTPQSNRPVMGIVQDTLTAVRKFTKRDVFLERGEVMNLLMFLSTWDGKVPQPAILKPRPLWTGKQIFSLIIPGHINCIRTHSTHPDDEDSGPYKHISPGDTKVVVENGELIMGILCKKSLGTSAGSLVHISYLEMGHDITRLFYSNIQTVINNWLLIEGHTIGIGDSIADSKTYQDIQNTIKKAKQDVIEVIEKAHNNELEPTPGNTLRQTFENQVNRILNDARDKTGSSAQKSLSEYNNFKSMVVSGAKGSKINISQVIAVVGQQNVEGKRIPFGFKHRTLPHFIKDDYGPESRGFVENSYLAGLTPTEFFFHAMGGREGLIDTAVKTAETGYIQRRLIKSMESVMVKYDATVRNSINQVVQLRYGEDGLAGESVEFQNLATLKPSNKAFEKKFRFDYTNERALRRTLQEDLVKDVLSNAHIQNELEREFERMREDREVLRVIFPTGDSKVVLPCNLLRMIWNAQKIFHINPRLPSDLHPIKVVEGVKELSKKLVIVNGDDPLSRQAQENATLLFNIHLRSTLCSRRMAEEFRLSGEAFDWLLGEIESKFNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAKNVTLGVPRLKELINISKKPKTPSLTVFLLGQSARDAERAKDILCRLEHTTLRKVTANTAIYYDPNPQSTVVAEDQEWVNVYYEMPDFDVARISPWLLRVELDRKHMTDRKLTMEQIAEKINAGFGDDLNCIFNDDNAEKLVLRIRIMNSDENKMQEEEEVVDKMDDDVFLRCIESNMLTDMTLQGIEQISKVYMHLPQTDNKKKIIITEDGEFKALQEWILETDGVSLMRVLSEKDVDPVRTTSNDIVEIFTVLGIEAVRKALERELYHVISFDGSYVNYRHLALLCDTMTCRGHLMAITRHGVNRQDTGPLMKCSFEETVDVLMEAAAHGESDPMKGVSENIMLGQLAPAGTGCFDLLLDAEKCKYGMEIPTNIPGLGAAGPTGMFFGSAPSPMGGISPAMTPWNQGATPAYGAWSPSVGSGMTPGAAGFSPSAASDASGFSPGYSPAWSPTPGSPGSPGPSSPYIPSPGGAMSPSYSPTSPAYEPRSPGGYTPQSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPNYTPTSPSYSPTSPSYSPTSPNYTPTSPNYSPTSPSYSPTSPSYSPTSPSYSPSSPRYTPQSPTYTPSSPSYSPSSPSYSPTSPKYTPTSPSYSPSSPEYTPTSPKYSPTSPKYSPTSPKYSPTSPTYSPTTPKYSPTSPTYSPTSPVYTPTSPKYSPTSPTYSPTSPKYSPTSPTYSPTSPKGSTYSPTSPGYSPTSPTYSLTSPAISPDDSDEEN 7oop_J mol:protein length:67 DNA-directed RNA polymerases I, II, and III subunit RPABC5 @@ -1360717,7 +1360717,7 @@ MWKDKEFQVLFVLTILTLISGTIFYSTVEGLRPIDALYFS 7oos_A mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7oos_B mol:na length:11 DNA (5'-D(CPTP*(RWT)PTPCPTPTPTPG)-3') -CT05KTCTTTG +CTTCTTTG 7oot_A mol:protein length:141 Interferon regulatory factor 4 MGSHHHHHHSAALEVLFQGPGGNGKLRQWLIDQIDSGKYPGLVWENEEKSIFRIPWKHAGKQDYNREEDAALFKAWALFKGKFREGIDKPDPPTWKTRLRCALNKSNDFEELVERSQLDISDPYKVYRIVPEGAKKGAKQL 7oot_B mol:protein length:141 Interferon regulatory factor 4 @@ -1364753,7 +1364753,7 @@ GSHMEYELPEDPKWEFPRDKLTLGKPLGEGCFGQVVMAEA 7ozz_A mol:na length:10 RNA (5'-R(CPAPAPAPGPAPAPAPAPG)-3') CAAAGAAAAG 7ozz_B mol:na length:11 DNA (5'-D(CPTP*(RWR)PTPCPTPTPTPG)-3') -CT05HTCTTTG +CTTCTTTG 7p00_H mol:protein length:298 Antibody fragment scFv16 MKFLVNVALVFMVVYISYIYADYKDDDDKHHHHHHHHHHLEVLFQGPDVQLVESGGGLVQPGGSRKLSCSASGFAFSSFGMHWVRQAPEKGLEWVAYISSGSGTIYYADTVKGRFTISRDDPKNTLFLQMTSLRSEDTAMYYCVRSIYYYGSSPFDFWGQGTTLTVSSGGGGSGGGGSGGGGSDIVMTQATSSVPVTPGESVSISCRSSKSLLHSNGNTYLYWFLQRPGQSPQLLIYRMSNLASGVPDRFSGSGSGTAFTLTISRLEAEDVGVYYCMQHLEYPLTFGAGTKLELKAAA 7p00_B mol:protein length:354 Guanine nucleotide-binding protein G(I)/G(S)/G(T) subunit beta-1
I do not think this error belongs to HHMsearch (the parse failed error), but to AlphaFold. May be an exception should be triggered, but not halt the whole process ?
Thanks a lot to your time, I'll report to HMMsearch too (linking this issue).