diamond
diamond copied to clipboard
blast tab output: stitle versus salltitles
For the blast_tab format, diamond is printing out all stitles rather than only the first when using the stitle flag.
$ blastx -query seq2.fna -db nr -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 ref|WP_094509356.1| 77.551 glutaminase A [Lactobacillus reuteri]
$ diamond blastx -q seq2.fna -d nr --outfmt 6 qseqid sseqid pident stitle | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 WP_094509356.1 77.6 WP_094509356.1 glutaminase A [Lactobacillus reuteri] >MRG62619.1 glutaminase A [Lactobacillus reuteri] >OYS46007.1 glutaminase A [Lactobacillus reuteri] >OYS47986.1 glutaminase A [Lactobacillus reuteri] >OYS51132.1 glutaminase A [Lactobacillus reuteri] >OYS54155.1 glutaminase A [Lactobacillus reuteri]
Also when specifying salltitles, they are not separated by <> like the help menu says:
$ diamond help | grep salltitles
salltitles means All Subject Title(s), separated by a '<>'
$ blastx -query seq2.fna -db nr -outfmt '6 qseqid sseqid pident salltitles' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 ref|WP_094509356.1| 77.551 glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]
$ diamond blastx -q seq2.fna -d nr --outfmt 6 qseqid sseqid pident salltitles | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 WP_094509356.1 77.6 WP_094509356.1 glutaminase A [Lactobacillus reuteri] >MRG62619.1 glutaminase A [Lactobacillus reuteri] >OYS46007.1 glutaminase A [Lactobacillus reuteri] >OYS47986.1 glutaminase A [Lactobacillus reuteri] >OYS51132.1 glutaminase A [Lactobacillus reuteri] >OYS54155.1 glutaminase A [Lactobacillus reuteri]
I ended up coding a fix for both the stitle/sseqid bug and the stitle/salltitles
I tried to keep the code as little changed as possible, and also in the same style
However I got rid of various code that referenced "splitting" on id_delimiters:
print_escaped_until(buf, id, full_titles ? "\1" : Const::id_delimiters, esc)
And id_delimiters is a weird assortment of control characters, which are not usual fasta format delimiters:
const char* Const::id_delimiters = " \a\b\f\n\r\t\v\1";
Thanks, feel free to post a pull request. \1
used to be the character for separating different accessions. I don't know if it's still used, but backward compatibility with these files should be maintained.
Ah, maybe you were right in suggesting a separate flag for trimming the seqids from the defline. While testing my code with test sequences I realized that blast does not correctly trim seqids, even when telling it to.
Original untouched NR
$ blastx -query seq.fna -db nr -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 ref|WP_094509356.1| 77.551 glutaminase A [Lactobacillus reuteri]
Dump the hit to fasta and then build a blast database (using all potential flags to make it work right)
$ blastdbcmd -db ~/bio/db/nr/nr -entry WP_094509356.1 -long_seqids > test.faa
$ makeblastdb -in test.faa -out test -dbtype prot -input_type fasta -parse_seqids
$ blastx -query seq.fna -db test -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012 ref|WP_094509356.1| 77.551 glutaminase A [Lactobacillus reuteri] >gb|MRG62619.1| glutaminase A [Lactobacillus reuteri] >gb|OYS46007.1| glutaminase A [Lactobacillus reuteri] >gb|OYS47986.1| glutaminase A [Lactobacillus reuteri] >gb|OYS51132.1| glutaminase A [Lactobacillus reuteri] >gb|OYS54155.1| glutaminase A [Lactobacillus reuteri]
So the current behavior of diamond is correct. And the issue is with blast, and it not running correctly on any database that is not the original downloaded from GenBank. Without using -parse_seqids flag, which is the same behavior as diamond:
$ makeblastdb -in test.faa -out test -dbtype prot -input_type fasta
$ blastx -query seq.fna -db test -outfmt '6 qseqid sseqid pident stitle'
K00124:494:H7722BBXX:5:1101:17574:2012 ref|WP_094509356.1| 77.551 ref|WP_094509356.1| glutaminase A [Lactobacillus reuteri] >gb|MRG62619.1| glutaminase A [Lactobacillus reuteri] >gb|OYS46007.1| glutaminase A [Lactobacillus reuteri] >gb|OYS47986.1| glutaminase A [Lactobacillus reuteri] >gb|OYS51132.1| glutaminase A [Lactobacillus reuteri] >gb|OYS54155.1| glutaminase A [Lactobacillus reuteri]
I ran into this issue years ago when I wanted to add a few sequences to the 16SMicrobial and found that once I dumped the sequences and then rebuilt, no manner of flags could recreate the behavior on the original untouched 16SMicrobial. Something to do with internal ordinal ids.
The question is whether to use a -parse_seqids flag during diamond makedb
to either trim off the first seqid (I don't know whether this would mess with diamond's inner workings), or set a flag in the database file so the trimming happens in the diamond blastx
step.
The other option is to have a -parse_seqids flag during the diamond blastx
command.
I guess for now I will let this sit for now to see if anyone else weighs in, and in the meantime just use sed to trim the output accordingly.
Indeed, the behaviour of blast also depends on whether you use -parse_seqids
and where you got the database from. Diamond databases always store the full length sequence titles, if trimming is to happen it should be during the alignment run.