dashing
dashing copied to clipboard
compare with HLL
Hi,
I have a question about the output from
./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq
The output i.e. 0.0005 (says that 0.05% of the k-mers in the union are shared)
I was surprised as I used the same command-line with different sizes of k-mers, the number of shared k-mers are increasing, but actually should decreasing. The results is:
k-mer size = 9 --> 06.66655 % are shared k-mer size = 10 --> 24.9956 % are shared k-mer size = 11 --> 60.9045 % are shared k-mer size = 12 --> 97.622 % are shared k-mer size = 13 --> 76.7513 % are shared k-mer size = 14 --> 39.642 % are shared
I am wondering that the number of shared k-mers between 9 and 12 (k-mer size) are increasing, but actually I expected that the number of shared k-mers will be at 9 the maximum and then after 9 will start decreasing.
Dataset: the fastq file has many reads from Listeria and the fasta file has the reference genome human.
As I was running the program with Listeria reads and reference genome from Listeria I had the expected results: by k-mer size 9 was the maximum number of shared k-mer then the number of shared k-mers started decreasing.
Could you please explain it to me, why the number of shared k-mers are increasing between k-mer size 9 and 12?
Thanks, Ahmad
Hi Ahmad,
Thanks for checking in on this, but I suspect you're running into something a little interesting: both the numerator and the denominator are changing as a function of k
.
If you add the additional flag of --sizes
, you'll get to see how the set sizes change, in the intersection and for the input sequences. My guess is that the denominator is changing substantially with these fairly small k.
Let me know what you find, and thanks for making the issue.
Best,
Daniel
Hi Daniel,
I have another question to understand the output exactly, so let`s back to the command-line:
./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq
The output is:
#Path Size (est.) Listeria.fastq 1048584 Listeria_monocytogenes_ATCC_19115_.fasta 888344 ##Names Listeria.fastq Listeria_monocytogenes_ATCC_19115_.fasta Listeria.fastq - 0.847185 Listeria_monocytogenes_ATCC_19115_.fasta - -
And if I run:
./dashing cmp –k10 -p5 -C -S24 --wj-exact --sizes reference_.fasta reads.fastq
The output is:
#Path Size (est.)
Listeria.fastq 1048584
Listeria_monocytogenes_ATCC_19115_.fasta 888344
##Names Listeria.fastq Listeria_monocytogenes_ATCC_19115_.fasta
Listeria.fastq - 888345
Listeria_monocytogenes_ATCC_19115_.fasta - -
Could you please explain to me, what are the numbers 888344, 1048584 and 888345 meaning?
Best, Ahmad
I think that I understood the output, so the output of my test is:
./dashing cmp -k10 -p5 -C -S24 --sizes reads.fastq reference.fasta
#Path Size (est.)
reads.fastq 1048218
reference.fasta 767497
##Names reads.fastq reference.fasta .fasta
reads.fastq - 767491
reference.fasta.fasta - -
Explanation: 1048218 is the size of reads set ( IAI )
767497 is the size of reference set ( IBI )
767491 not sure about it!
./dashing cmp -k10 -p5 -C -S24 --wj-exact reads.fastq reference.fasta
#Path Size (est.)
reads.fastq 1048584
reference.fasta 888344
##Names reads.fastq reference.fasta
reads.fastq - 0.847185
reference.fasta - -
Explanation: 1048584 is the union size 888344 is the intersection size 0.847185 is the jaccard similarity (888344/1048584)
** correct?**
Thanks!
best, Ahmad
Hi Ahmad,
Sure! First, Dashing emits the cardinalities of the input sequences, and then it emits pairwise similarities.
Above, 1048218
is the estimated number of k-mers in the read set, while 767497
is the number of k-mers in the fasta and 767491
is the number of k-mers shared between the read and the reference. This is because --sizes
was added, so the emitted results are intersection sizes (= shared k-mers).
Without --sizes, it calculates the fraction of shared k-mers over total k-mers. (|A AND B|) / (|A OR B|). Intersection is the number of k-mers which are in both sets.
Hope it helps, and feel free to ask any more questions,
Daniel
One more thing to add:
The union and intersection are related.
|A OR B| = |A| + |B| - |A & B|
the emitted quantity for the pair is |A & B| for --sizes, and |A & B| / (|A OR B|) for the Jaccard. We estimate |A|, |B|, and |A OR B| to compute |A & B|.
Thanks for the answers, I am trying now to run with --containment-index
because I want to have the containment score to the reads set.
I would like to ask about the output from it, I am running with the same command line:
./dashing cmp -k10 -p5 -C -S24 --containment-index reads.fastq reference.fasta
output:
reads.fastq 1048218
reference.fasta 767497
reads.fastq 1048218
reference.fasta 767497
reads.fastq 1.000000 0.999992
reference.fasta 0.732186 1.000000
we know that 1048218
is the number of k-mers in reads set and 767497
is the number of k-mers in reference set.
The containment score is counted by: (|A & B| / |A|)
Can you please explain to me, what are the number 0.999992
, 0.732186
and 1.000000
mean?
Best
Ahmad
I think that the value 0.999992 is the answer from containment score, where |A| is the reads set. correct?
Thanks
best Ahmad
Hi, sorry for the delay on responding.
In this case, then 0.999992 is the containment score in one direction, while the containment score in the other direction was .73. This is because 73% of the k-mers in one set were in the other set, but in the opposite direction, nearly all k-mers were present in both, which I wouldn't usually expect unless the k was small, which it is in your case. If you want better discrimination, try k = 16 or 20.
Thanks for the answer, do you mean with one direction, when I take (|A & B| / |A|), where |A| is the reads set? And the other direction will be, where |A| is the reference set ?
For the containment problem, it computes containment in both directions: |A & B| / |A|, and |A & B| / |B| is the other, meaning how many of k-mers from the reference were in the read set. In this case, only 70% of the k-mers from the reads were contained in the reference k-mer set, but nearly all the k-mers in the reference were in the sequencing data.
Exactly that is what I thought, the 73% is the containment score from reads( 73% k-mers from the read set are contained in the reference set) that makes sense for me. Thank you very much for your help.
Best Ahmad