dashing
dashing copied to clipboard
Querying with presketches: 'std::bad_alloc'
Hi!
I'm using the same references (-F) multiple times and thought it would be faster if I'd sketch them once and use only the sketches in the future for querying.
Sketching:
dashing sketch -F references.fasta_paths.txt -k 32 -p 2 --sketch-size 20 --use-bb-minhash
Querying:
dashing dist -F references.sketch_paths.txt -k 32 -p 2 --sizes --sketch-size 20 --use-bb-minhash -T -Q testdata_path.txt --presketched
Query error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted
Removing the -Q testdata_path.txt --presketched
outputs square matrix without errors.
Dashing version: v0.5-3-g03c10
Minor comments/questions:
- Is the missing "#Names" line intentional in query output? Asking because it is present in the square matrix and getting the order from the "sizes section/file" seems a bit odd.
- Is there any way to specify sketch output dir/name?
-o
seems to put them all into a single file. -
dashing -v
outputs the version twice. Probably one time as always, the other as a response to the command.
Hi! Thanks for making the issue.
-
For what you're trying to do, you want to use the
--cache-sketches/-W
option; this will cache a sketch adjacent to the input filename (e.g., something likeinput1.fq.s10.hll
forinput1.fq
,input2.fq.s10.hll
forinput2.fq
...). The--presketched-only
option treats the filenames as binary files containing sketches. Enabling the option there meant that dashing was trying to load a binary HLL sketch from the input fastx files, which meant it didn't work. -
The sketching option (without -o) puts sketches adjacent to the fastx files (as in 1). You can specify a prefix
-P/--prefix
, which prepends a prefix to the path where sketches are written, which I've used to put sketches into a specific folder.
tl;dr:
If you want to use --presketched-only
, sketch the files, create a file consisting of paths to the sketches, and then run your second command using that file.
If you just want to avoid re-sketching each time you run, use -W
/--cache-sketches
Minor comments:
- Names are emitted in the
-o
output and in the-O
output; the first is the distance table and the latter is names + cardinalities of the input sequences, so I think you may just want to be setting the -O parameter. See below for an example run. - See above RE:
--prefix
- This is correct; having added the version to all invocations, the
-v
option does this twice. We'll remove this in the next release.
Feel free to ask if you have any more questions, and thanks!
Daniel
$ ./dashing dist -F fnames.txt -Q fnames.txt -o table -O sizes
Dashing version: v0.5-3-g8b24
$ cat table
#Path Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 2368528
$ cat sizes
##Names bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz 1.000000 0.000000 0.000000 0.000000
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz 0.000000 1.000000 0.000000 0.000000
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz 0.000000 0.000000 1.000000 0.550403
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz 0.000000 0.000000 0.550403 1.000000
Hi!
Presketching
Looking at the code and the help, the flag is --presketched
and not --presketched-only
(name of the var in code)?
Did I understand correctly that --presketched
is meant to be used on the single file (mentioned also above) that is outputted with -o
while doing dashing sketch ... -o single_file_containing_many_sketches
?
Caching
This seems to be the easiest solution at the moment. Trying it out, caching once and removing the fastas works too.
Also: the short -W
doesn't seem to do anything whereas the long --cache-sketches
deposits the sketches into the fasta dir as expected.
Missing "#Names"
They were missing because I was using -T
while querying, which doesn't make sense come to think of it. Removed it and all good.
All the best, Mihkel
Hi Mihkel,
Sorry for making you wait.
You're correct, it is --presketched
, not presketched-only. Presketched means that the files themselves contain one sketch per file. The dist_by_seq
command is for performing distance calculations from a file containing a number of sketches, which could have been created by sketch
ing with the -o
parameter, or by sketching each sequence in a file separately with sketch_by_seq
.
Unfortunately, what to do with sequence names and metadata for both approaches isn't intuitively obvious for me.
Thanks for the report, and good luck!
Daniel
Hi!
Maybe it's a WIP but I noticed a commit (or two) mentioning -W
and --cache-sketches
. While the longer version works, the shorter outputs
Dashing version: v0.5-5-g5210
terminate called after throwing an instance of 'std::bad_alloc'
terminate called recursively
what(): std::bad_alloc
Aborted
Mihkel
Thank you! I was trying to address the cache-sketches issue, but modified the wrong variable. It should be fixed now in both master and dev.
I can confirm that the -W
works now.
Is the ##Names
row intentionally removed in v0.5-5-g5210
while querying with -Q
?