diamond
diamond copied to clipboard
Error: invalid character (.) in sequence
Hello,
I'm running Diamond on a set of samples consisting on FASTA files with aminoacid sequences. I generated those files from the annotation files by using:
gffread file.gff -g file.fa -y aa_file.fa
Then the files consist on fasta headers and the aminoacid sequence. It says that the problem appears in different positions (2408, 2496, 2527, etc) but I look at those positions in the fasta file and there's nothing wrong qith the sequence and specially there's not any "." in the sequence as the error says. I also checked the other related issues with invalid characters but couldn't relate it to my case. I'll submit one of the files here (from a public reference genome). Thank you in advance.
There seems to be a .
in your sequence:
>g2497.t1 gene=g2497
MNIYTSPTRTPNIAPKSGQRPSLPMLATDERSTDKESPNEDREFVPCSSLDVRRIYPKGPLLVLPEKIYL
YSEPTVKELLPFDVVINVAEEANDLRMQVPAVEYHHYRWEHDSQIALDLPSLTSIIHAATTKREKILIHC
QCGLSRSATLIIAYIMKYHNLSLRHSYDLLKSRADKINPSIGLIFQLMEWEVALNAKTNVQANSYRKKRS
LSSYLSNVSTRREELEKISKQETSEEEDTAGKHEQRETLSEEVSDKFPENVASFRSQTTSVHQATQNNLN
AKESEDLAHKNDASSHEGEVNGDSRPDDVPETNEKISQAIRAKISSSSSSPNVRNVDIQNHQPFSRDQLR
AMLKEPKRKTVDDFIEEEGLGAVEEEDLSDEVLEKNTTEPENVEKDIEYSDSDKDTDDVGSDDPTAPNSP
IKLGRRKLVRGDQLDATTSSMFNNESDSELSDIDDSKNIALSSSLFRGGSSPVKETNNNLSNMNSSPAQN
PKRGSVSRSNDSNKSSHIAVSKRPKQKKGIYRDSGGRTRLQIACDKGKYDVVKKMIEEGGYDINDQDNAG
NTALHEAALQGHIEIVELLIENGADVNIKSIEMFGDTPLIDASANGHLDVVKYLLKNGADPTIRNAKGLT
AFESVDDESEFDDEEDQKILREIKKRLSIAAKKWTNRAGIHNDKSKNGNNAHTIDQPPFDNTTKAKNEKA
ADSPSMASNIDEKAPEEEFYWTDVTSRAGKEKLFKASKEGHLPYVGTYVENGGKIDLRSFFESVKCGHED
ITSIFLAFGFPVNQTSRDNKTSALMVAVGRGHLGTVKLLLEAGADPTKRDKKGRTALYYAKNSIMGITNS
EEIQLIENAINNYLKKHSEDNNDDDDDDDNNNETYKHEKKREKTQSPILASRRSATPRIEDEEDDTRMLN
LADDDFNNDRDVKESTTSDSRKRLDDNENVGTQYSLDWKKRKTNALQDEEKLKSISPLSMEPHSPKKAKS
VEISKIHEETAAEREARLKEEEEYRKKRLEKKRKKEQELLQKLAEDEKKRIEEQEKQKVLEMERLEKATL
EKARKMEREKEMEEISYRRAVRDLYPLGLKIINFNDKLDYKRFLPLYYFVDEKNDKFVLDLQVMILLKDI
DLLSKDNQPTSEKIPVDPSHLTPLWNMLKFIFLYGGSYDDKKNNMENKRYVVNFDGVDLDTKIGYELLEY
KKFVSLPMAWIKWDNVVIENHAKRKEIEGNMIQISINEFARWRNDKLNKAQQPTRKQRSLKIPRELPVKF
QHRMSISSVLQQTSKEPF.FVQTKALSKATLTDLPERWENMPNLEQKEIADNLTERQKLPWKTLNNEEIK
AAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGISFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSK
NANPWGGYSQVQSK
Sorry! I thought that the number that appears before the Error referred to the position (line) in the file, not the entry. I guess that .
should be a *
. It may has to do with the fact that some genes have introns (like that one, g2497) and in the annotation file (gff) the position with the .
is instead an X
. I guess gffread
transformed it to a .
.I'll see more in detail why there's a dot there and if I should remove it or transform it to *
. Thank you very much for your prompt answer.
@bbuchfink (love the tool!!! - great work. I use it loads!). Would it be possible to make diamond ignore "." or "*", basically translated stops?
A *
should already be ignored or treated as a stop. I'm not aware that a .
is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can edit src/basic/value.cpp
, line 58:
const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);
In the string "UO-"
you can add additional characters that should be ignored and treated as X.
A
*
should already be ignored or treated as a stop. I'm not aware that a.
is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can editsrc/basic/value.cpp
, line 58:
const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);
In the string
"UO-"
you can add additional characters that should be ignored and treated as X.
Hello! I wonder that whether the option to ignore certain characters have been added. Thank you!
No sorry, this has not been added.