vsearch
vsearch copied to clipboard
Varying number of columns in blast6out output file from search
I am running the following command:
vsearch --usearch_global $query --id 0.9 --query_cov 0.8 --strand 'both' --maxaccepts 1 --maxrejects 32 --db $ref --threads 16 --blast6out $output
With the expectation that maxaccepts will cause only one hit (the first hit exceeding the search criteria) to be reported.
However, of the 66036 sequences in $query, 18314 of them have > 1 hit reported:
$ cut -f 1 $output | sort | uniq -d | wc -l
18314
$ grep '00009dab2f32e5997c324f5c0822720a' $output
00009dab2f32e5997c324f5c0822720a RS_GCF_001553035.1 99.1 450 450 1 1 1554 -1 0
00009dab2f32e5997c324f5c0822720a RS_GCF_900476045.1 96.0 450 16 2 1 450 1 1554 -1 0
Oddly and perhaps related, the first of two records reported above appears to have the wrong number of columns in the output?
I have confirmed that the input query sequences are unique.
Any clue what is going on? Am I just misinterpreting how maxaccepts
performs?
Hi Nick,
Try this:
--maxhits positive integer
Maximum number of hits to show once the search is terminated (hits are sorted by
decreasing identity). Unlimited by default. ...
I'm not sure if this is a bug or not. One reasonable use case for both these commands is --maxaccepts 100 --maxhits 1
in order to compare lots of similar reads then only report the best one. But in your case, I'm not sure if --maxaccepts 1
should imply --maxhits 1
.
Thanks @colinbrislawn ! I have seen the maxhits
parameter, and I suppose my question could be refined to "what is the difference between maxhits and maxaccepts?"
In the example I gave above my search criteria are quite permissive, so I would expect > 1 hit for just about every query sequence if maxaccepts does not control the number of hits reported.
Hi!
The reason vsearch gives two hits here is that it reports one hit for each strand (because of the --strand both
option). The two strands are searched separately. Using --maxhits 1
as @colinbrislawn suggested will only report the best hit for each query. This is similar to what usearch does (at least in version 7 that most of vsearch is based on).
There should always be 12 columns in the file written to the file specified with --blast6out
. The upper one seems to lack two columns. However, I am not able to reproduce this issue. Which version of vsearch did you use, on which platform?
Thank you for reporting this issue.
Thanks so much for the explanation, @torognes ! That makes sense re: the --strand both
option.
Version/system info: vsearch v2.7.0_linux_x86_64 CentOS release 6.10
Thanks. I am unfortunately still unable to reproduce the issue regarding the missing columns.
Would you be able to try the same with the latest vsearch version 2.14.1?
Or perhaps you could provide the sequence 00009dab2f32e5997c324f5c0822720a from the query, and RS_GCF_001553035.1 and RS_GCF_900476045.1 from the reference database? That might be enough to reproduce it.
Query Seq:
>00009dab2f32e5997c324f5c0822720a
GACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAAGTTTTTCTGGTGCTTGCACTGGAAAAAACTTAGCGGCGAACGGGTGAGTAACACGTAAAGAACCTGCCTCATAGACTGGGACAACTATTGGAAACGATAGCTAATACCGGATAACAGCATTAACTGCATGGTTGATGTTTGAAAGTTGGTTTTGCTAACACTATGAGATGGCTTTGCGGTGCATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGAAGAAGGATTTCGGGTTCGTAAAGCTCTGTTGTTAGGGAAGAATGATTGTGTAGTAA
Ref seqs:
>RS_GCF_900476045.1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Gemella;s__Gemella morbillorum [locus_tag=NZ_LS483440.1-#4] [location=6886..8440] [ssu_len=1554] [contig_len=1760930]
TGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAAGTTTTTCTGGTGCTTGCACTAGAAAAACTTAGCGGCGAACGGGTGAGTAACACGTAAAGAACCTGCCTCATAGACTGGGACAACTATTGGAAACGATAGCTAATACCGGATAACAGTATTTCTCGCATGAGAGATATTTAAAAGTTGGTTATGCTAACACTATGAGATGGCTTTGCGGTGCATTAGCTAGTTGGTGGGGTAAAGGCCCACCAAGGCGACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGAAGAAGGATTTCGGTTCGTAAAGCTCTGTTGTTAGGGAAGAATGGATATGTAGTAACTATACATGTAAGAGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGCGCGCGCAGGTGGTTTAATAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGTTAAACTTGAGTGCAGGAGAGAAAAGTGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCGGCTTTTTGGCCTGTAACTGACACTGAGGCGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTGGTCTCATAAGAGATCAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAGTCTTGACATACTGTGAGGACACAAGAGATTGTGTTGTTCTGACCTTTGGTTAGACACAGATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATATCTAGTTGCCAGCAGTAAGATGGGGACTCTAGATAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACTTGGGCTACACACGTGCTACAATGGATAGGAACAAAGAGAAGCGAGCTCGCGAGAGTCAGCCAACCTCATAAAACTATTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGAATCAGAATGTCGCGGTGAATACGTTCCCGGGTCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGACGGTGGCCTAACCTTTAAGGAGGGAGCCGGTCACGGTGGGACAGATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTGGATCACCTCCTTTC
>RS_GCF_001553035.1 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Gemellaceae;g__Gemella;s__Gemella haemolysans_B [locus_tag=NZ_KQ959942.1] [location=53..1607] [ssu_len=1554] [contig_len=1672]
AGGAGGTGATCCAGCCGCACCTTCCGATACGGCTACCTTGTTACGACTTCACCCCAATCATCTGTCCCACCGTGACCGGCTCCCTCCTAAAAGGTTAGGCCACCGTCTTCGGGTGTTACAAACTCTCGTGGTGTGACGGGCGGTGTGTACAAGACCCGGGAACGTATTCACCGCGACATTCTGATTCGCGATTACTAGCGATTCCAGCTTCATGTAGTCGAGTTGCAGACTACAATCCGAACTGAGAATAGTTTTGTGAGGTTTGCTTACTCTCGCGAGCTCGCTTCTCTTTGTTCCTATCCATTGTAGCACGTGTGTAGCCCAAGTCATAAGGGGCATGATGATTTGACGTCATCCCCACCTTCCTCCAGTTTATCACTGGCAGTCTATCTAGAGTCCCCATCTTACTGCTGGCAACTAGATATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAACCATGCACCACCTGTATCTGTGTCTAACCAAAGGTCAGAACAACACAATCTCTTGTGTCCTCACAGTATGTCAAGACTTGGTAAGGTTCTTCGCGTTGCTTCGAATTAAACCACATGCTCCACCGCTTGTGCGGGTCCCCGTCAATTCCTTTGAGTTTCAACCTTGCGGTCGTACTCCCCAGGCGGAGTGCTTAATGCGTTAGCTGCAGCACTGATCTCTTATGAGACCAACACTTAGCACTCATCGTTTACGGCGTGGACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGCGCCTCAGTGTCAGTTACAGGCCAAAAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATCTCTACGCATTTCACCGCTACACTAGGAATTCCACTTTTCTCTCCTGCACTCAAGTTTAACAGTTTCCAATGACCCTCCACGGTTGAGCCGTGGGCTTTCACATCAGACTTATTAAACCACCTGCGCGCGCTTTACGCCCAATAATTCCGGACAACGCTTGCCACCTACGTATTACCGCGGCTGCTGGCACGTAGTTAGCCGTGGCTTTCTGGTTAGGTACCGTCTCTACTGTGTATAGTTACTACACAATCATTCTTCCCTAACAACAGAGCTTTACGAACCGAAATCCTTCTTCACTCACGCGGCGTTGCTCCGTCAGGCTTTCGCCCATTGCGGAAGATTCCCTACTGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTCTCAGGTCGGCTATGCATCGTCGCCTTGGTAGGCCTTTACCCCACCAACTAGCTAATGCACCGCAAAGCCATCTCATAGTGTTAGCAAAACCAACTTTTAAACATCAACCATGCAGTTAATGCTGTTATCCGGTATTAGCTATCGTTTCCAATAGTTGTCCCAGTCTATGAGGCAGGTTCTTTACGTGTTACTCACCCGTTCGCCGCTAAGTTTTTCCGGTGCAAGCACCAGAAAAACTTCGCTCGACTTGCATGTATTAGGCACGCCGCCAGCGTTCGTCCTGAGCCAGGATCAAACTCTCCAAAT
Thanks for providing the sequences. Unfortunately, I am still unable to reproduce the issue.
I have used vsearch 2.7.0 on CentOS 6.9 and run the commands with the supplied sequences. I get the expected results:
00009dab2f32e5997c324f5c0822720a RS_GCF_001553035.1 99.1 450 2 2 450 1 1 1554 -1 0
00009dab2f32e5997c324f5c0822720a RS_GCF_900476045.1 96.0 450 16 2 1 450 1 1554 -1 0
The database is much smaller here, but otherwise things should be fairly similar. The first line now has the missing 2's in columns 5 and 6.
Thanks @torognes I am not able to reproduce this with the mini example, but have re-run the original and can confirm that I am getting the same issue with missing columns:
$ head $cdir/vsearch-test-out.txt
648faabd722ce275b262ab7b0d52b65a RS_GCF_000010185.1 98.4 450 450 1 1518 -1 0
648faabd722ce275b262ab7b0d52b65a RS_GCF_003182075.1 98.2 450 450 1 1 1518 -1 0
7280cad3ed62eafa39fa88f1ded9b3d0 RS_GCF_000263635.1 98.7 452 450 1 1 1523 -1 0
7280cad3ed62eafa39fa88f1ded9b3d0 RS_GCF_001042655.1 97.6 452 450 1 1523 -1 0
21b24ad03dd8c3e903bae52d422e34cd RS_GCF_000025205.1 98.7 452 450 1 1 1522 -1 0
21b24ad03dd8c3e903bae52d422e34cd RS_GCF_000263595.1 98.5 452 450 1 1522 -1 0
a9b8f9f6510aee7e522cb49562c209d3 RS_GCF_001543105.1 99.6 450 450 1 1545 -1 0
a9b8f9f6510aee7e522cb49562c209d3 RS_GCF_002884575.1 91.3 450 37 2 450 1 1 1544 -1 0
dbc984b6a29a60e18b21d8ee7240725e RS_GCF_002088015.1 99.1 451 450 1 1560 -1 0
dbc984b6a29a60e18b21d8ee7240725e RS_GCF_000614735.1 96.9 451 12 2 450 1 1 1557 -1 0
I will attempt to test again with the latest vsearch release, but this will be a bit delayed as I do not have root access to the CentOS system I am using. If you would like to test with the reference database I am using you can access the file here:
https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_ssu_r89.fna
Hi @torognes I tested in version 2.14.1 and I am getting the same result.
$ vsearch --version
vsearch v2.14.1_linux_x86_64, 62.9GB RAM, 24 cores
$ head $cdir/vsearch-test-out.txt
e0f7b19a89bf28b2c37ecbb9450b7aed GB_GCA_003096895.1 97.6 450 450 1 1483 -1 0
e0f7b19a89bf28b2c37ecbb9450b7aed GB_GCA_000503195.1 93.3 450 26 3 450 1 1 1485 -1 0
a9b8f9f6510aee7e522cb49562c209d3 RS_GCF_001543105.1 99.6 450 450 1 1545 -1 0
a9b8f9f6510aee7e522cb49562c209d3 RS_GCF_002884575.1 91.3 450 37 2 450 1 1 1544 -1 0
21b24ad03dd8c3e903bae52d422e34cd RS_GCF_000025205.1 98.7 452 450 1 1 1522 -1 0
21b24ad03dd8c3e903bae52d422e34cd RS_GCF_000263595.1 98.5 452 450 1 1522 -1 0
648faabd722ce275b262ab7b0d52b65a RS_GCF_000010185.1 98.4 450 450 1 1518 -1 0
648faabd722ce275b262ab7b0d52b65a RS_GCF_003182075.1 98.2 450 450 1 1 1518 -1 0
dbc984b6a29a60e18b21d8ee7240725e RS_GCF_002088015.1 99.1 451 450 1 1560 -1 0
dbc984b6a29a60e18b21d8ee7240725e RS_GCF_000614735.1 96.9 451 12 2 450 1 1 1557 -1 0
Here are my query sequences (this is HMP microbiome data if you are curious of the source) if you want to test (the db I already linked to above): dna-sequences.fasta.zip
Thanks for providing the sequences and running the search again.
This is weird. I run exactly the same searches (same two vsearch versions, x86, linux, Centos 6.9) and always get the expected results with no columns missing.
I wonder if it could be something related to threads, but each of these lines are printed in a single (atomic) fprintf statement (results.cc, line 162) and the printing is also protected by a mutex.
I'll look further into this.
@nbokulich and @torognes I was not able to reproduce this bug (i.e., unexpected number of columns in blast6out
output). Using the same vsearch release, the same command line arguments, and the same input data, I launched 2,000 independent jobs on a large cluster running CentOS 7. They all produced the same expected output.
I am closing that issue. Feel free to open it again if new information becomes available.