sourmash
sourmash copied to clipboard
explain p_match and p_query in sourmash documentation
it's not easy to find!
from @bluegenes in an e-mail response -
-
Yes the p_match (percent of reference genome matched) does depend on both the relative abundance of that genome in the metagenome (because this can affect the # k-mers from this genome found in the metagenome signature) and the similarity with the reference genome (# of k-mers that will match). However, what you're interested in is ANI, or % DNA identity, right? We can get a decent approximation of ANI by using the "Mash Distance" equations (see https://github.com/dib-lab/sourmash/issues/1242 for explanation and a python function for conversion). Note that these were developed for jaccard, not containment. The revisions we're working on for Scaled Minhash --> ANI will improve the accuracy for Scaled MinHash containment and put confidence intervals on our estimates.
-
There are a few use cases where folks want to go from hashes/k-mer sequences back to reads, but it's still a work in progress. If you're planning on doing alignments, Titus has been actively developing a workflow for this ("genome-grist"), which uses sourmash to identify genomes of interest from genbank/other references, download them, and then map the metagenome reads to those reference genomes. You can find it here: https://github.com/dib-lab/genome-grist and I'm happy to help you try it out!
These sourmash issues are directly related to your #2 and might also be useful: https://github.com/dib-lab/sourmash/issues/1237 https://github.com/dib-lab/sourmash/issues/483
@ctb I would extend this request to add a clear explanation of "overlap" and if "overlap", "p_query", or "p_match" are calculated/interrupted differently in the context of abund sketches
This question is motived by unexpected values of overlap as seen below:
overlap p_query p_match
--------- ------- -------
7.2 Mbp 6.1% 0.5% GCA_006912095.1 Massospora platypedia...
5.8 Mbp 3.1% 0.8% GCA_004523865.1 Nephromyces sp. ex Mo...
4.2 Mbp 2.2% 0.4% GCA_900893395.1 Euglena gracilis stra...
4.8 Mbp 1.9% 0.6% GCA_009809945.1 Gigaspora margarita s...
4.1 Mbp 1.6% 0.4% GCA_003724095.1 Austropuccinia psidii...
3.2 Mbp 1.0% 0.5% GCA_009731375.1 Paulinella micropora ...
3.8 Mbp 0.9% 0.4% GCA_003550325.1 Gigaspora rosea strai...
2.4 Mbp 0.8% 0.2% GCA_000978595.1 Saccharina japonica c...
2.5 Mbp 0.7% 0.3% GCA_000711775.1 Oxytricha trifallax s...
2.2 Mbp 0.7% 0.1% GCA_009767595.1 Symbiodinium kawaguti...
4.1 Mbp 0.6% 0.7% GCA_002104975.1 Neocallimastix califo...
1.9 Mbp 0.6% 0.1% GCA_000507305.1 Breviolum minutum Mf ...
The last 2 entries have the same p_query but the overlap seems to be proportional to the p_match which does not make sense to me
The documentation here (appendix A, on sourmash gather) and here (on abundance weighting) would probably be the place to look. See also appendix B, with detailed info on abundance calculations.
Specific suggestions for changes would be very welcome!
The most likely explanation for the above situation is that the results are from query signatures computed with --track-abundance
. Overlap does not use abundance weighting, while p_match
does. If you could run this same search with --ignore-abundance
and report back that would be lovely :)
code references
The (more optimized, hence ugly/unreadable) code for this is in the src/sourmash/search.py
, class Gather Databases
, method __next__
. For example,
f_match = len(intersect_mh) / len(found_mh)
f_orig_query = len(intersect_orig_mh) / orig_query_len
The code in tests/test_sourmash.py::test_gather_f_match_orig
checks that the calculations match against the straight up sourmash signature / sketch code -- e.g.
# f_orig_query is the containment of the query by the match.
# (note, this only works because containment is 100% in combined).
assert approx_equal(combined_sig.contained_by(match), f_orig_query)
but it's only tested for non-abund signature at the moment. It would be a great addition if someone were to do something similar with abund signatures.
Oh, and the code for what's actually output by the sourmash CLI for p_match when you run sourmash gather is in src/sourmash/commands.py
, function gather(...)
, and it looks like this:
pct_query = '{:.1f}%'.format(result.f_unique_weighted*100)
so it does in fact use the f_unique_weighted
which is abundance-weighted for track-abund signatures.
Regarding the documentation structure, I have some specific suggestions:
- I saw the page of "Using sourmash from the command line" as the entry point to the detailed tutorial. While the page "Classifying signatures: ..." is more like a blog post explaining the technique. Therefore, as a I user, I am expecting the detailed tutorial page to explain the command options, how to use, and the format of each command's inputs/outputs. As an alternative, we might have a separate page for inputs/outputs formats that we can cross linked everywhere.
- The section of "appendix A, on sourmash gather" has a clear explanation of the csv output but does not link this anyhow to the standard output table of gather.
I think --track-abundance
affects p_query
not p_match
as you are saying above but the example in the appendix is confusing because you are changing the query. So I did some testing for the abundance effect:
I am using 2 samples (sampleA.fq and sampleB.fq). Each one has 25k reads.
You can find the samples in /home/tahmed/test_sourmash_abund
Then I created a campsite metagenome of (10 * sampleA) + (sampleB)
> data/metagenome1_10.fq
for x in {1..10};do echo $x;
cat data/sampleA.fq >> data/metagenome1_10.fq
done
cat data/sampleB.fq >> data/metagenome1_10.fq
Then queried the metagenome against the 2 samples with and without --track-abundance
mkdir sig1000
sourmash sketch dna -p k=21,scaled=1000,abund data/sampleA.fq -o sig1000/sampleA.sig --name sampleA
sourmash sketch dna -p k=21,scaled=1000,abund data/sampleB.fq -o sig1000/sampleB.sig --name sampleB
sourmash sketch dna -p k=21,scaled=1000,abund data/metagenome1_10.fq -o sig1000/metagenome1_10.sig --name metagenome1_10
sourmash index database sig1000/sampleA.sig sig1000/sampleB.sig
sourmash gather sig1000/metagenome1_10.sig database.sbt.zip
sourmash gather --ignore-abundance sig1000/metagenome1_10.sig database.sbt.zip
The output with abund tracking
overlap p_query p_match avg_abund
--------- ------- ------- ---------
2.3 Mbp 17.6% 100.0% 2.6 sampleB
1.7 Mbp 82.4% 99.1% 17.0 sampleA
The output with --ignore-abundance
overlap p_query p_match
--------- ------- -------
2.3 Mbp 58.0% 100.0% sampleB
1.7 Mbp 42.0% 99.1% sampleA
So far, I can not find a clear definition to p_query
. But here is my guessing:
if p_match
has the same definition of f_match
, then it means "how much of the match is in the remaining query , after all of the previous matches have been removed".
Similarly p_query
means "how much of the remaining query is in the match" which consider the abundance (f_unique_weighted) by default unless --ignore-abundance
is specified
ref https://github.com/sourmash-bio/sourmash/issues/1227