dashing icon indicating copy to clipboard operation
dashing copied to clipboard

Querying with presketches: 'std::bad_alloc'

Open mihkelvaher opened this issue 4 years ago • 6 comments

Hi!

I'm using the same references (-F) multiple times and thought it would be faster if I'd sketch them once and use only the sketches in the future for querying.

Sketching: dashing sketch -F references.fasta_paths.txt -k 32 -p 2 --sketch-size 20 --use-bb-minhash

Querying: dashing dist -F references.sketch_paths.txt -k 32 -p 2 --sizes --sketch-size 20 --use-bb-minhash -T -Q testdata_path.txt --presketched

Query error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

Removing the -Q testdata_path.txt --presketched outputs square matrix without errors. Dashing version: v0.5-3-g03c10

Minor comments/questions:

  1. Is the missing "#Names" line intentional in query output? Asking because it is present in the square matrix and getting the order from the "sizes section/file" seems a bit odd.
  2. Is there any way to specify sketch output dir/name? -o seems to put them all into a single file.
  3. dashing -v outputs the version twice. Probably one time as always, the other as a response to the command.

mihkelvaher avatar Jul 31 '20 12:07 mihkelvaher

Hi! Thanks for making the issue.

  1. For what you're trying to do, you want to use the --cache-sketches/-W option; this will cache a sketch adjacent to the input filename (e.g., something like input1.fq.s10.hll for input1.fq, input2.fq.s10.hll for input2.fq...). The --presketched-only option treats the filenames as binary files containing sketches. Enabling the option there meant that dashing was trying to load a binary HLL sketch from the input fastx files, which meant it didn't work.

  2. The sketching option (without -o) puts sketches adjacent to the fastx files (as in 1). You can specify a prefix -P/--prefix, which prepends a prefix to the path where sketches are written, which I've used to put sketches into a specific folder.

tl;dr: If you want to use --presketched-only, sketch the files, create a file consisting of paths to the sketches, and then run your second command using that file.

If you just want to avoid re-sketching each time you run, use -W/--cache-sketches

Minor comments:

  1. Names are emitted in the -o output and in the -O output; the first is the distance table and the latter is names + cardinalities of the input sequences, so I think you may just want to be setting the -O parameter. See below for an example run.
  2. See above RE: --prefix
  3. This is correct; having added the version to all invocations, the -v option does this twice. We'll remove this in the next release.

Feel free to ask if you have any more questions, and thanks!

Daniel

$ ./dashing dist -F fnames.txt -Q fnames.txt -o table -O sizes
Dashing version: v0.5-3-g8b24
$ cat table
#Path	Size (est.)
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	4829255
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	2718859
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	2433839
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	2368528
$ cat sizes
##Names	bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz
bonsai/test/GCF_001723155.1_ASM172315v1_genomic.fna.gz	1.000000	0.000000	0.000000	0.000000
bonsai/test/GCF_000302455.1_ASM30245v1_genomic.fna.gz	0.000000	1.000000	0.000000	0.000000
bonsai/test/GCF_000953115.1_DSM1535_genomic.fna.gz	0.000000	0.000000	1.000000	0.550403
bonsai/test/GCF_000762265.1_ASM76226v1_genomic.fna.gz	0.000000	0.000000	0.550403	1.000000

dnbaker avatar Jul 31 '20 13:07 dnbaker

Hi!

Presketching Looking at the code and the help, the flag is --presketched and not --presketched-only (name of the var in code)? Did I understand correctly that --presketched is meant to be used on the single file (mentioned also above) that is outputted with -o while doing dashing sketch ... -o single_file_containing_many_sketches?

Caching This seems to be the easiest solution at the moment. Trying it out, caching once and removing the fastas works too. Also: the short -W doesn't seem to do anything whereas the long --cache-sketches deposits the sketches into the fasta dir as expected.

Missing "#Names" They were missing because I was using -T while querying, which doesn't make sense come to think of it. Removed it and all good.

All the best, Mihkel

mihkelvaher avatar Aug 03 '20 12:08 mihkelvaher

Hi Mihkel,

Sorry for making you wait.

You're correct, it is --presketched, not presketched-only. Presketched means that the files themselves contain one sketch per file. The dist_by_seq command is for performing distance calculations from a file containing a number of sketches, which could have been created by sketching with the -o parameter, or by sketching each sequence in a file separately with sketch_by_seq.

Unfortunately, what to do with sequence names and metadata for both approaches isn't intuitively obvious for me.

Thanks for the report, and good luck!

Daniel

dnbaker avatar Aug 19 '20 15:08 dnbaker

Hi!

Maybe it's a WIP but I noticed a commit (or two) mentioning -W and --cache-sketches. While the longer version works, the shorter outputs

Dashing version: v0.5-5-g5210
terminate called after throwing an instance of 'std::bad_alloc'
terminate called recursively
  what():  std::bad_alloc
Aborted

Mihkel

mihkelvaher avatar Aug 27 '20 07:08 mihkelvaher

Thank you! I was trying to address the cache-sketches issue, but modified the wrong variable. It should be fixed now in both master and dev.

dnbaker avatar Aug 28 '20 14:08 dnbaker

I can confirm that the -W works now. Is the ##Names row intentionally removed in v0.5-5-g5210 while querying with -Q?

mihkelvaher avatar Aug 31 '20 12:08 mihkelvaher