diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Error: invalid character (.) in sequence

Open cdiazmun opened this issue 4 years ago • 6 comments

Hello,

I'm running Diamond on a set of samples consisting on FASTA files with aminoacid sequences. I generated those files from the annotation files by using:

gffread file.gff -g file.fa -y aa_file.fa

Then the files consist on fasta headers and the aminoacid sequence. It says that the problem appears in different positions (2408, 2496, 2527, etc) but I look at those positions in the fasta file and there's nothing wrong qith the sequence and specially there's not any "." in the sequence as the error says. I also checked the other related issues with invalid characters but couldn't relate it to my case. I'll submit one of the files here (from a public reference genome). Thank you in advance.

S288C_genome_nomit.fa.gz

cdiazmun avatar Aug 13 '20 08:08 cdiazmun

There seems to be a . in your sequence:

>g2497.t1 gene=g2497
MNIYTSPTRTPNIAPKSGQRPSLPMLATDERSTDKESPNEDREFVPCSSLDVRRIYPKGPLLVLPEKIYL
YSEPTVKELLPFDVVINVAEEANDLRMQVPAVEYHHYRWEHDSQIALDLPSLTSIIHAATTKREKILIHC
QCGLSRSATLIIAYIMKYHNLSLRHSYDLLKSRADKINPSIGLIFQLMEWEVALNAKTNVQANSYRKKRS
LSSYLSNVSTRREELEKISKQETSEEEDTAGKHEQRETLSEEVSDKFPENVASFRSQTTSVHQATQNNLN
AKESEDLAHKNDASSHEGEVNGDSRPDDVPETNEKISQAIRAKISSSSSSPNVRNVDIQNHQPFSRDQLR
AMLKEPKRKTVDDFIEEEGLGAVEEEDLSDEVLEKNTTEPENVEKDIEYSDSDKDTDDVGSDDPTAPNSP
IKLGRRKLVRGDQLDATTSSMFNNESDSELSDIDDSKNIALSSSLFRGGSSPVKETNNNLSNMNSSPAQN
PKRGSVSRSNDSNKSSHIAVSKRPKQKKGIYRDSGGRTRLQIACDKGKYDVVKKMIEEGGYDINDQDNAG
NTALHEAALQGHIEIVELLIENGADVNIKSIEMFGDTPLIDASANGHLDVVKYLLKNGADPTIRNAKGLT
AFESVDDESEFDDEEDQKILREIKKRLSIAAKKWTNRAGIHNDKSKNGNNAHTIDQPPFDNTTKAKNEKA
ADSPSMASNIDEKAPEEEFYWTDVTSRAGKEKLFKASKEGHLPYVGTYVENGGKIDLRSFFESVKCGHED
ITSIFLAFGFPVNQTSRDNKTSALMVAVGRGHLGTVKLLLEAGADPTKRDKKGRTALYYAKNSIMGITNS
EEIQLIENAINNYLKKHSEDNNDDDDDDDNNNETYKHEKKREKTQSPILASRRSATPRIEDEEDDTRMLN
LADDDFNNDRDVKESTTSDSRKRLDDNENVGTQYSLDWKKRKTNALQDEEKLKSISPLSMEPHSPKKAKS
VEISKIHEETAAEREARLKEEEEYRKKRLEKKRKKEQELLQKLAEDEKKRIEEQEKQKVLEMERLEKATL
EKARKMEREKEMEEISYRRAVRDLYPLGLKIINFNDKLDYKRFLPLYYFVDEKNDKFVLDLQVMILLKDI
DLLSKDNQPTSEKIPVDPSHLTPLWNMLKFIFLYGGSYDDKKNNMENKRYVVNFDGVDLDTKIGYELLEY
KKFVSLPMAWIKWDNVVIENHAKRKEIEGNMIQISINEFARWRNDKLNKAQQPTRKQRSLKIPRELPVKF
QHRMSISSVLQQTSKEPF.FVQTKALSKATLTDLPERWENMPNLEQKEIADNLTERQKLPWKTLNNEEIK
AAWYISYGEWGPRRPVHGKGDVAFITKGVFLGLGISFGLFGLVRLLANPETPKTMNREWQLKSDEYLKSK
NANPWGGYSQVQSK

bbuchfink avatar Aug 13 '20 08:08 bbuchfink

Sorry! I thought that the number that appears before the Error referred to the position (line) in the file, not the entry. I guess that . should be a *. It may has to do with the fact that some genes have introns (like that one, g2497) and in the annotation file (gff) the position with the . is instead an X. I guess gffread transformed it to a ..I'll see more in detail why there's a dot there and if I should remove it or transform it to *. Thank you very much for your prompt answer.

cdiazmun avatar Aug 13 '20 09:08 cdiazmun

@bbuchfink (love the tool!!! - great work. I use it loads!). Would it be possible to make diamond ignore "." or "*", basically translated stops?

peterthorpe5 avatar Dec 17 '20 12:12 peterthorpe5

A * should already be ignored or treated as a stop. I'm not aware that a . is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can edit src/basic/value.cpp, line 58:

const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);

In the string "UO-" you can add additional characters that should be ignored and treated as X.

bbuchfink avatar Dec 17 '20 15:12 bbuchfink

A * should already be ignored or treated as a stop. I'm not aware that a . is also used to encode a stop. An option to ignore certain characters could certainly be added. If you don't mind doing a little hacking, you can edit src/basic/value.cpp, line 58:

const Value_traits amino_acid_traits(AMINO_ACID_ALPHABET, 23, "UO-", Sequence_type::amino_acid);

In the string "UO-" you can add additional characters that should be ignored and treated as X.

Hello! I wonder that whether the option to ignore certain characters have been added. Thank you!

yanyew avatar Feb 18 '22 14:02 yanyew

No sorry, this has not been added.

bbuchfink avatar Feb 22 '22 13:02 bbuchfink