vsearch icon indicating copy to clipboard operation
vsearch copied to clipboard

Varying number of columns in blast6out output file from search

Open nbokulich opened this issue 5 years ago • 10 comments

I am running the following command:

vsearch --usearch_global $query --id 0.9 --query_cov 0.8 --strand 'both' --maxaccepts 1 --maxrejects 32 --db $ref --threads 16 --blast6out $output

With the expectation that maxaccepts will cause only one hit (the first hit exceeding the search criteria) to be reported.

However, of the 66036 sequences in $query, 18314 of them have > 1 hit reported:

$ cut -f 1 $output | sort | uniq -d | wc -l
18314

$ grep '00009dab2f32e5997c324f5c0822720a' $output
00009dab2f32e5997c324f5c0822720a	RS_GCF_001553035.1	99.1	450	450	1	1	1554	-1	0
00009dab2f32e5997c324f5c0822720a	RS_GCF_900476045.1	96.0	450	16	2	1	450	1	1554	-1	0

Oddly and perhaps related, the first of two records reported above appears to have the wrong number of columns in the output?

I have confirmed that the input query sequences are unique.

Any clue what is going on? Am I just misinterpreting how maxaccepts performs?

nbokulich avatar Oct 05 '19 22:10 nbokulich

Hi Nick,

Try this:

--maxhits positive integer
Maximum number of hits to show once the search is terminated (hits are sorted by
decreasing identity). Unlimited by default. ...

I'm not sure if this is a bug or not. One reasonable use case for both these commands is --maxaccepts 100 --maxhits 1 in order to compare lots of similar reads then only report the best one. But in your case, I'm not sure if --maxaccepts 1 should imply --maxhits 1.

colinbrislawn avatar Oct 05 '19 23:10 colinbrislawn

Thanks @colinbrislawn ! I have seen the maxhits parameter, and I suppose my question could be refined to "what is the difference between maxhits and maxaccepts?"

In the example I gave above my search criteria are quite permissive, so I would expect > 1 hit for just about every query sequence if maxaccepts does not control the number of hits reported.

nbokulich avatar Oct 05 '19 23:10 nbokulich

Hi!

The reason vsearch gives two hits here is that it reports one hit for each strand (because of the --strand both option). The two strands are searched separately. Using --maxhits 1 as @colinbrislawn suggested will only report the best hit for each query. This is similar to what usearch does (at least in version 7 that most of vsearch is based on).

There should always be 12 columns in the file written to the file specified with --blast6out. The upper one seems to lack two columns. However, I am not able to reproduce this issue. Which version of vsearch did you use, on which platform?

Thank you for reporting this issue.

torognes avatar Oct 07 '19 13:10 torognes

Thanks so much for the explanation, @torognes ! That makes sense re: the --strand both option.

Version/system info: vsearch v2.7.0_linux_x86_64 CentOS release 6.10

nbokulich avatar Oct 07 '19 13:10 nbokulich

Thanks. I am unfortunately still unable to reproduce the issue regarding the missing columns.

Would you be able to try the same with the latest vsearch version 2.14.1?

Or perhaps you could provide the sequence 00009dab2f32e5997c324f5c0822720a from the query, and RS_GCF_001553035.1 and RS_GCF_900476045.1 from the reference database? That might be enough to reproduce it.

torognes avatar Oct 07 '19 14:10 torognes

Query Seq:

>00009dab2f32e5997c324f5c0822720a
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAAGTTTTTCTGGTGCTTGCACTGGAAAAAACTTAGCGGCGAACGGGTGAGTAACACGTAAAGAACCTGCCTCATAGACTGGGACAACTATTGGAAACGATAGCTAATACCGGATAACAGCATTAACTGCATGGTTGATGTTTGAAAGTTGGTTTTGCTAACACTATGAGATGGCTTTGCGGTGCATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGAAGAAGGATTTCGGGTTCGTAAAGCTCTGTTGTTAGGGAAGAATGATTGTGTAGTAA

Ref seqs:

>RS_GCF_900476045.1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Gemella;s__Gemella morbillorum [locus_tag=NZ_LS483440.1-#4] [location=6886..8440] [ssu_len=1554] [contig_len=1760930]
TGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAAGTTTTTCTGGTGCTTGCACTAGAAAAACTTAGCGGCGAACGGGTGAGTAACACGTAAAGAACCTGCCTCATAGACTGGGACAACTATTGGAAACGATAGCTAATACCGGATAACAGTATTTCTCGCATGAGAGATATTTAAAAGTTGGTTATGCTAACACTATGAGATGGCTTTGCGGTGCATTAGCTAGTTGGTGGGGTAAAGGCCCACCAAGGCGACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGAAGAAGGATTTCGGTTCGTAAAGCTCTGTTGTTAGGGAAGAATGGATATGTAGTAACTATACATGTAAGAGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGCCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTGGTCTCATAAGAGATCAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATACTGTGAGGACACAAGAGATTGTGTTGTTCTGACCTTTGGTTAGACACAGATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATATCTAGTTGCCAGCAGTAAGATGGGGACTCTAGATAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACTTGGGCTACACACGTGCTACAATGGATAGGAACAAAGAGAAGCGAGCTCGCGAGAGTCAGCCAACCTCATAAAACTATTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGTCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGACGGTGGCCTAACCTTTAAGGAGGGAGCCGGTCACGGTGGGACAGATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTGGATCACCTCCTTTC
>RS_GCF_001553035.1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Gemella;s__Gemella haemolysans_B [locus_tag=NZ_KQ959942.1] [location=53..1607] [ssu_len=1554] [contig_len=1672]
AGGAGGTGATCCAGCCGCACCTTCCGATACGGCTACCTTGTTACGACTTCACCCCAATCATCTGTCCCACCGTGACCGGCTCCCTCCTAAAAGGTTAGGCCACCGTCTTCGGGTGTTACAAACTCTCGTGGTGTGACGGGCGGTGTGTACAAGACCCGGGAACGTATTCACCGCGACATTCTGATTCGCGATTACTAGCGATTCCAGCTTCATGTAGTCGAGTTGCAGACTACAATCCGAACTGAGAATAGTTTTGTGAGGTTTGCTTACTCTCGCGAGCTCGCTTCTCTTTGTTCCTATCCATTGTAGCACGTGTGTAGCCCAAGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCAGTTTATCACTGGCAGTCTATCTAGAGTCCCCATCTTACTGCTGGCAACTAGATATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTATCTGTGTCTAACCAAAGGTCAGAACAACACAATCTCTTGTGTCCTCACAGTATGTCAAGACTTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTGATCTCTTATGAGACCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGTGTCAGTTACAGGCCAAAAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATCTCTACGCATTTCACCGCTACACTAGGAATTCCACTTTTCTCTCCTGCACTCAAGTTTAACAGTTTCCAATGACCCTCCACGGTTGAGCCGTGGGCTTTCACATCAGACTTATTAAACCACCTGCGCGCGCTTTACGCCCAATAATTCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGGTTAGGTACCGTCTCTACTGTGTATAGTTACTACACAATCATTCTTCCCTAACAACAGAGCTTTACGAACCGAAATCCTTCTTCACTCACGCGGCGTTGCTCCGTCAGGCTTTCGCCCATTGCGGAAGATTCCCTACTGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTCTCAGGTCGGCTATGCATCGTCGCCTTGGTAGGCCTTTACCCCACCAACTAGCTAATGCACCGCAAAGCCATCTCATAGTGTTAGCAAAACCAACTTTTAAACATCAACCATGCAGTTAATGCTGTTATCCGGTATTAGCTATCGTTTCCAATAGTTGTCCCAGTCTATGAGGCAGGTTCTTTACGTGTTACTCACCCGTTCGCCGCTAAGTTTTTCCGGTGCAAGCACCAGAAAAACTTCGCTCGACTTGCATGTATTAGGCACGCCGCCAGCGTTCGTCCTGAGCCAGGATCAAACTCTCCAAAT

nbokulich avatar Oct 07 '19 16:10 nbokulich

Thanks for providing the sequences. Unfortunately, I am still unable to reproduce the issue.

I have used vsearch 2.7.0 on CentOS 6.9 and run the commands with the supplied sequences. I get the expected results:

00009dab2f32e5997c324f5c0822720a	RS_GCF_001553035.1	99.1	450	2	2	450	1	1	1554	-1	0
00009dab2f32e5997c324f5c0822720a	RS_GCF_900476045.1	96.0	450	16	2	1	450	1	1554	-1	0

The database is much smaller here, but otherwise things should be fairly similar. The first line now has the missing 2's in columns 5 and 6.

torognes avatar Oct 08 '19 09:10 torognes

Thanks @torognes I am not able to reproduce this with the mini example, but have re-run the original and can confirm that I am getting the same issue with missing columns:

$ head $cdir/vsearch-test-out.txt
648faabd722ce275b262ab7b0d52b65a	RS_GCF_000010185.1	98.4	450	450	1	1518	-1	0
648faabd722ce275b262ab7b0d52b65a	RS_GCF_003182075.1	98.2	450	450	1	1	1518	-1	0
7280cad3ed62eafa39fa88f1ded9b3d0	RS_GCF_000263635.1	98.7	452	450	1	1	1523	-1	0
7280cad3ed62eafa39fa88f1ded9b3d0	RS_GCF_001042655.1	97.6	452	450	1	1523	-1	0
21b24ad03dd8c3e903bae52d422e34cd	RS_GCF_000025205.1	98.7	452	450	1	1	1522	-1	0
21b24ad03dd8c3e903bae52d422e34cd	RS_GCF_000263595.1	98.5	452	450	1	1522	-1	0
a9b8f9f6510aee7e522cb49562c209d3	RS_GCF_001543105.1	99.6	450	450	1	1545	-1	0
a9b8f9f6510aee7e522cb49562c209d3	RS_GCF_002884575.1	91.3	450	37	2	450	1	1	1544	-1	0
dbc984b6a29a60e18b21d8ee7240725e	RS_GCF_002088015.1	99.1	451	450	1	1560	-1	0
dbc984b6a29a60e18b21d8ee7240725e	RS_GCF_000614735.1	96.9	451	12	2	450	1	1	1557	-1	0

I will attempt to test again with the latest vsearch release, but this will be a bit delayed as I do not have root access to the CentOS system I am using. If you would like to test with the reference database I am using you can access the file here:

https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_ssu_r89.fna

nbokulich avatar Oct 08 '19 14:10 nbokulich

Hi @torognes I tested in version 2.14.1 and I am getting the same result.

$ vsearch --version
vsearch v2.14.1_linux_x86_64, 62.9GB RAM, 24 cores

$ head $cdir/vsearch-test-out.txt
e0f7b19a89bf28b2c37ecbb9450b7aed	GB_GCA_003096895.1	97.6	450	450	1	1483	-1	0
e0f7b19a89bf28b2c37ecbb9450b7aed	GB_GCA_000503195.1	93.3	450	26	3	450	1	1	1485	-1	0
a9b8f9f6510aee7e522cb49562c209d3	RS_GCF_001543105.1	99.6	450	450	1	1545	-1	0
a9b8f9f6510aee7e522cb49562c209d3	RS_GCF_002884575.1	91.3	450	37	2	450	1	1	1544	-1	0
21b24ad03dd8c3e903bae52d422e34cd	RS_GCF_000025205.1	98.7	452	450	1	1	1522	-1	0
21b24ad03dd8c3e903bae52d422e34cd	RS_GCF_000263595.1	98.5	452	450	1	1522	-1	0
648faabd722ce275b262ab7b0d52b65a	RS_GCF_000010185.1	98.4	450	450	1	1518	-1	0
648faabd722ce275b262ab7b0d52b65a	RS_GCF_003182075.1	98.2	450	450	1	1	1518	-1	0
dbc984b6a29a60e18b21d8ee7240725e	RS_GCF_002088015.1	99.1	451	450	1	1560	-1	0
dbc984b6a29a60e18b21d8ee7240725e	RS_GCF_000614735.1	96.9	451	12	2	450	1	1	1557	-1	0

Here are my query sequences (this is HMP microbiome data if you are curious of the source) if you want to test (the db I already linked to above): dna-sequences.fasta.zip

nbokulich avatar Oct 08 '19 15:10 nbokulich

Thanks for providing the sequences and running the search again.

This is weird. I run exactly the same searches (same two vsearch versions, x86, linux, Centos 6.9) and always get the expected results with no columns missing.

I wonder if it could be something related to threads, but each of these lines are printed in a single (atomic) fprintf statement (results.cc, line 162) and the printing is also protected by a mutex.

I'll look further into this.

torognes avatar Oct 09 '19 09:10 torognes

@nbokulich and @torognes I was not able to reproduce this bug (i.e., unexpected number of columns in blast6out output). Using the same vsearch release, the same command line arguments, and the same input data, I launched 2,000 independent jobs on a large cluster running CentOS 7. They all produced the same expected output.

I am closing that issue. Feel free to open it again if new information becomes available.

frederic-mahe avatar Dec 20 '23 16:12 frederic-mahe