dashing compare with HLL

Hi, I have a question about the output from ./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq The output i.e. 0.0005 (says that 0.05% of the k-mers in the union are shared)

I was surprised as I used the same command-line with different sizes of k-mers, the number of shared k-mers are increasing, but actually should decreasing. The results is:

k-mer size = 9 --> 06.66655 % are shared k-mer size = 10 --> 24.9956 % are shared k-mer size = 11 --> 60.9045 % are shared k-mer size = 12 --> 97.622 % are shared k-mer size = 13 --> 76.7513 % are shared k-mer size = 14 --> 39.642 % are shared

I am wondering that the number of shared k-mers between 9 and 12 (k-mer size) are increasing, but actually I expected that the number of shared k-mers will be at 9 the maximum and then after 9 will start decreasing.

Dataset: the fastq file has many reads from Listeria and the fasta file has the reference genome human.

As I was running the program with Listeria reads and reference genome from Listeria I had the expected results: by k-mer size 9 was the maximum number of shared k-mer then the number of shared k-mers started decreasing.

Could you please explain it to me, why the number of shared k-mers are increasing between k-mer size 9 and 12?

Thanks, Ahmad

Dec 03 '20 21:12 lutfia95

Hi Ahmad,

Thanks for checking in on this, but I suspect you're running into something a little interesting: both the numerator and the denominator are changing as a function of k.

If you add the additional flag of --sizes, you'll get to see how the set sizes change, in the intersection and for the input sequences. My guess is that the denominator is changing substantially with these fairly small k.

Let me know what you find, and thanks for making the issue.

Best,

Daniel

Dec 04 '20 16:12 dnbaker

Hi Daniel,

I have another question to understand the output exactly, so let`s back to the command-line:

./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq The output is:

#Path Size (est.) Listeria.fastq 1048584 Listeria_monocytogenes_ATCC_19115_.fasta 888344 ##Names Listeria.fastq Listeria_monocytogenes_ATCC_19115_.fasta Listeria.fastq - 0.847185 Listeria_monocytogenes_ATCC_19115_.fasta - -

And if I run: ./dashing cmp –k10 -p5 -C -S24 --wj-exact --sizes reference_.fasta reads.fastq The output is: #Path Size (est.) Listeria.fastq 1048584 Listeria_monocytogenes_ATCC_19115_.fasta 888344 ##Names Listeria.fastq Listeria_monocytogenes_ATCC_19115_.fasta Listeria.fastq - 888345 Listeria_monocytogenes_ATCC_19115_.fasta - -

Could you please explain to me, what are the numbers 888344, 1048584 and 888345 meaning?

Best, Ahmad

Dec 04 '20 17:12 lutfia95

I think that I understood the output, so the output of my test is:

./dashing cmp -k10 -p5 -C -S24 --sizes reads.fastq reference.fasta #Path Size (est.) reads.fastq 1048218 reference.fasta 767497 ##Names reads.fastq reference.fasta .fasta reads.fastq - 767491 reference.fasta.fasta - -

Explanation: 1048218 is the size of reads set ( IAI ) 767497 is the size of reference set ( IBI ) 767491 not sure about it! ./dashing cmp -k10 -p5 -C -S24 --wj-exact reads.fastq reference.fasta #Path Size (est.) reads.fastq 1048584 reference.fasta 888344 ##Names reads.fastq reference.fasta reads.fastq - 0.847185 reference.fasta - -

Explanation: 1048584 is the union size 888344 is the intersection size 0.847185 is the jaccard similarity (888344/1048584)

** correct?**

Thanks!

best, Ahmad

Dec 07 '20 15:12 lutfia95

Hi Ahmad,

Sure! First, Dashing emits the cardinalities of the input sequences, and then it emits pairwise similarities.

Above, 1048218 is the estimated number of k-mers in the read set, while 767497 is the number of k-mers in the fasta and 767491 is the number of k-mers shared between the read and the reference. This is because --sizes was added, so the emitted results are intersection sizes (= shared k-mers).

Without --sizes, it calculates the fraction of shared k-mers over total k-mers. (|A AND B|) / (|A OR B|). Intersection is the number of k-mers which are in both sets.

Hope it helps, and feel free to ask any more questions,

Daniel

Dec 07 '20 16:12 dnbaker

One more thing to add:

The union and intersection are related.

|A OR B| = |A| + |B| - |A & B|

the emitted quantity for the pair is |A & B| for --sizes, and |A & B| / (|A OR B|) for the Jaccard. We estimate |A|, |B|, and |A OR B| to compute |A & B|.

Dec 07 '20 17:12 dnbaker

Thanks for the answers, I am trying now to run with --containment-index because I want to have the containment score to the reads set. I would like to ask about the output from it, I am running with the same command line: ./dashing cmp -k10 -p5 -C -S24 --containment-index reads.fastq reference.fasta output: reads.fastq 1048218 reference.fasta 767497 reads.fastq 1048218 reference.fasta 767497 reads.fastq 1.000000 0.999992 reference.fasta 0.732186 1.000000

we know that 1048218 is the number of k-mers in reads set and 767497 is the number of k-mers in reference set. The containment score is counted by: (|A & B| / |A|) Can you please explain to me, what are the number 0.999992, 0.732186 and 1.000000 mean?

Best

Ahmad

Dec 10 '20 10:12 lutfia95

I think that the value 0.999992 is the answer from containment score, where |A| is the reads set. correct?

Thanks

best Ahmad

Dec 14 '20 21:12 lutfia95

Hi, sorry for the delay on responding.

In this case, then 0.999992 is the containment score in one direction, while the containment score in the other direction was .73. This is because 73% of the k-mers in one set were in the other set, but in the opposite direction, nearly all k-mers were present in both, which I wouldn't usually expect unless the k was small, which it is in your case. If you want better discrimination, try k = 16 or 20.

Dec 14 '20 21:12 dnbaker

Thanks for the answer, do you mean with one direction, when I take (|A & B| / |A|), where |A| is the reads set? And the other direction will be, where |A| is the reference set ?

Dec 14 '20 22:12 lutfia95

For the containment problem, it computes containment in both directions: |A & B| / |A|, and |A & B| / |B| is the other, meaning how many of k-mers from the reference were in the read set. In this case, only 70% of the k-mers from the reads were contained in the reference k-mer set, but nearly all the k-mers in the reference were in the sequencing data.

Dec 15 '20 16:12 dnbaker

Exactly that is what I thought, the 73% is the containment score from reads( 73% k-mers from the read set are contained in the reference set) that makes sense for me. Thank you very much for your help.

Best Ahmad

Dec 15 '20 17:12 lutfia95

dashing dashing copied to clipboard

compare with HLL

Dataset: the fastq file has many reads from Listeria and the fasta file has the reference genome human.

dashing
dashing copied to clipboard