diamond icon indicating copy to clipboard operation
diamond copied to clipboard

blast tab output: stitle versus salltitles

Open deprekate opened this issue 4 years ago • 4 comments

For the blast_tab format, diamond is printing out all stitles rather than only the first when using the stitle flag.

$ blastx -query seq2.fna -db nr  -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	ref|WP_094509356.1|	77.551	glutaminase A [Lactobacillus reuteri]
$ diamond blastx -q seq2.fna -d nr  --outfmt 6 qseqid sseqid pident stitle | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	WP_094509356.1	77.6	WP_094509356.1 glutaminase A [Lactobacillus reuteri] >MRG62619.1 glutaminase A [Lactobacillus reuteri] >OYS46007.1 glutaminase A [Lactobacillus reuteri] >OYS47986.1 glutaminase A [Lactobacillus reuteri] >OYS51132.1 glutaminase A [Lactobacillus reuteri] >OYS54155.1 glutaminase A [Lactobacillus reuteri]

Also when specifying salltitles, they are not separated by <> like the help menu says:

$ diamond help | grep salltitles
	salltitles means All Subject Title(s), separated by a '<>'
$ blastx -query seq2.fna -db nr  -outfmt '6 qseqid sseqid pident salltitles' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	ref|WP_094509356.1|	77.551	glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]<>glutaminase A [Lactobacillus reuteri]
$ diamond blastx -q seq2.fna -d nr  --outfmt 6 qseqid sseqid pident salltitles | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	WP_094509356.1	77.6	WP_094509356.1 glutaminase A [Lactobacillus reuteri] >MRG62619.1 glutaminase A [Lactobacillus reuteri] >OYS46007.1 glutaminase A [Lactobacillus reuteri] >OYS47986.1 glutaminase A [Lactobacillus reuteri] >OYS51132.1 glutaminase A [Lactobacillus reuteri] >OYS54155.1 glutaminase A [Lactobacillus reuteri]

deprekate avatar Jun 23 '20 23:06 deprekate

I ended up coding a fix for both the stitle/sseqid bug and the stitle/salltitles

I tried to keep the code as little changed as possible, and also in the same style

https://github.com/deprekate/diamond/commit/a1df95613f95288b0fcabfcec8bd3feb5c45cb90#diff-1a9234261157fee6f1e888a81f078b77

However I got rid of various code that referenced "splitting" on id_delimiters: print_escaped_until(buf, id, full_titles ? "\1" : Const::id_delimiters, esc) And id_delimiters is a weird assortment of control characters, which are not usual fasta format delimiters: const char* Const::id_delimiters = " \a\b\f\n\r\t\v\1";

deprekate avatar Jun 24 '20 06:06 deprekate

Thanks, feel free to post a pull request. \1 used to be the character for separating different accessions. I don't know if it's still used, but backward compatibility with these files should be maintained.

bbuchfink avatar Jun 24 '20 08:06 bbuchfink

Ah, maybe you were right in suggesting a separate flag for trimming the seqids from the defline. While testing my code with test sequences I realized that blast does not correctly trim seqids, even when telling it to.

Original untouched NR

$ blastx -query seq.fna -db nr -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	ref|WP_094509356.1|	77.551	glutaminase A [Lactobacillus reuteri]

Dump the hit to fasta and then build a blast database (using all potential flags to make it work right)

$ blastdbcmd -db ~/bio/db/nr/nr -entry WP_094509356.1 -long_seqids > test.faa
$ makeblastdb -in test.faa -out test -dbtype prot -input_type fasta -parse_seqids
$ blastx -query seq.fna -db test -outfmt '6 qseqid sseqid pident stitle' | head -n1
K00124:494:H7722BBXX:5:1101:17574:2012	ref|WP_094509356.1|	77.551	glutaminase A [Lactobacillus reuteri] >gb|MRG62619.1| glutaminase A [Lactobacillus reuteri] >gb|OYS46007.1| glutaminase A [Lactobacillus reuteri] >gb|OYS47986.1| glutaminase A [Lactobacillus reuteri] >gb|OYS51132.1| glutaminase A [Lactobacillus reuteri] >gb|OYS54155.1| glutaminase A [Lactobacillus reuteri]

So the current behavior of diamond is correct. And the issue is with blast, and it not running correctly on any database that is not the original downloaded from GenBank. Without using -parse_seqids flag, which is the same behavior as diamond:

$ makeblastdb -in test.faa -out test -dbtype prot -input_type fasta
$ blastx -query seq.fna -db test -outfmt '6 qseqid sseqid pident stitle'
K00124:494:H7722BBXX:5:1101:17574:2012	ref|WP_094509356.1|	77.551	ref|WP_094509356.1| glutaminase A [Lactobacillus reuteri] >gb|MRG62619.1| glutaminase A [Lactobacillus reuteri] >gb|OYS46007.1| glutaminase A [Lactobacillus reuteri] >gb|OYS47986.1| glutaminase A [Lactobacillus reuteri] >gb|OYS51132.1| glutaminase A [Lactobacillus reuteri] >gb|OYS54155.1| glutaminase A [Lactobacillus reuteri]

I ran into this issue years ago when I wanted to add a few sequences to the 16SMicrobial and found that once I dumped the sequences and then rebuilt, no manner of flags could recreate the behavior on the original untouched 16SMicrobial. Something to do with internal ordinal ids.


The question is whether to use a -parse_seqids flag during diamond makedb to either trim off the first seqid (I don't know whether this would mess with diamond's inner workings), or set a flag in the database file so the trimming happens in the diamond blastx step. The other option is to have a -parse_seqids flag during the diamond blastx command.

I guess for now I will let this sit for now to see if anyone else weighs in, and in the meantime just use sed to trim the output accordingly.

deprekate avatar Jun 24 '20 21:06 deprekate

Indeed, the behaviour of blast also depends on whether you use -parse_seqids and where you got the database from. Diamond databases always store the full length sequence titles, if trimming is to happen it should be during the alignment run.

bbuchfink avatar Jun 26 '20 08:06 bbuchfink