sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

Interpreting Sourmash Gather Results

Open krastegar opened this issue 1 year ago • 5 comments

Hi,

I am getting some funky results from the sourmash gather command. I am plotting the results of the csv that gather outputs, but when I sort the values by 'gather_result_ranking' I get a plot that has completely different levels of "total matched kmers". By this I mean that rank 0 does not have the most matched kmers and rank "n" does not have the least matched kmers. From my understanding that is how the ranking from sourmash gather goes in according to which ref genomes were found first (i.e which ref genomes have the largest overlap with my metagenome) and then it continues to look for the next ref. genome that has the largest kmers with the first found sequence......so on and so forth. Could you help explain why I am seeing these weird results?

My code:

    gather_df = gatherDF_cleanUp(gather_df)
    gather_df = gather_df.sort_values(['gather_result_rank'], ascending=False)
    #Plotting figure
    fig, (ax1, ax2) = pylab.subplots(1, 2, figsize=(10, 8), constrained_layout=True)
    SUBSAMPLE_TO = 36
    #pylab.plot(left_df.covered_bp / 1e6, left_df.iloc[::-1].index, 'b.', label='mapped bp to this genome')
    ax1.plot(gather_df.intersect_bp / 1e6,gather_df.iloc[::-1].index, 'g<',
             label='total k-mers matched')
    ax1.plot(gather_df.unique_intersect_bp / 1e6, gather_df.iloc[::-1].index, 'ro',
             label='remaining k-mers matched')

    positions = list(gather_df.index)
    labels = list(reversed(gather_df.Name))

    ax1.set_yticks(positions)
    ax1.set_yticklabels(labels, fontsize='small')

    ax1.set_xlabel('millions of k-mers')
    ax1.axis(ymin=-1, ymax=SUBSAMPLE_TO0)
    ax1.legend(loc='lower right')
    ax1.grid(True, axis='both')

    ax2.plot(gather_df.f_match_orig * 100, gather_df.iloc[::-1].index, 'g<', label='total k-mer cover')
    ax2.plot(gather_df.f_match * 100, gather_df.iloc[::-1].index, 'ro', label='remaining k-mer cover')
    ax2.set_yticks(positions)
    ax2.set_yticklabels([])
    ax2.set_xlabel('% of genome covered')
    ax2.legend(loc='lower left')
    ax2.axis(xmin=40, xmax=102)
    ax2.axis(ymin=-1, ymax=SUBSAMPLE_TO)
    ax2.grid(True)

Screenshot from 2022-08-14 19-43-30

I also attached an image of my plot Screenshot from 2022-08-14 19-43-30

krastegar avatar Aug 15 '22 02:08 krastegar

My apologies for the messy copy and pasting of the code. I will attach a screenshot which might be more useful Screenshot from 2022-08-14 19-46-22

krastegar avatar Aug 15 '22 03:08 krastegar

hi! thanks for all the code, this is great info ;). (I edited your top post to use triple backquotes around the code, BTW - seems to fix the formatting.)

The max unique_intersect_bp should (will? ;) be rank 0, then the second biggest should be rank 1.

My plotting code for this is in box 13 of this notebook.

When I use

    gather_df = gather_df.sort_values(['gather_result_rank'], ascending=False)

in this notebook, it works fine. 🤔

Ahh! OK, it's this:

ax1.axis(ymin=-1, ymax=SUBSAMPLE_TO)

It looks like you're plotting the last SUBSAMPLE_TO matches (interval [-SUBSAMPLE_TO:-1]) instead of the first SUBSAMPLE_TO matches. I think this is because of the way we're reversing the y values in gather_df.iloc inside of ax1.plot.

If, instead, you do

ax1.axis(ymin=max_y - SUBSAMPLE_TO, ymax=max_y)

where max_y is

max(gather_df.iloc[::-1].index)

it will work.

(Another hint that led me to this is that, at least for the early rank gather matches, the red and the green marks should mostly match in values.)

ctb avatar Aug 15 '22 14:08 ctb

Awesome! This fixed the issue. I also have a follow up question. I am using the gather command to get a list of reference genomes to train my classifier, based on kmer composition. I originally was filtering the list of genomes by 'f_match_orig>0.5' . Would you suggest a better filtering condition? Ultimately I want my classifier to be trained on the kmer composition of these reference genomes that sourmash gather outputs.

krastegar avatar Aug 15 '22 19:08 krastegar

Awesome! This fixed the issue.

Fantastic!

I also have a follow up question. I am using the gather command to get a list of reference genomes to train my classifier, based on kmer composition. I originally was filtering the list of genomes by 'f_match_orig>0.5' . Would you suggest a better filtering condition? Ultimately I want my classifier to be trained on the kmer composition of these reference genomes that sourmash gather outputs.

Cool! What kind of classification problem are you tackling?

Just thinking out loud, f_match_orig > 0.5 should mean that the genome is 50% or more present in the original metagenome, based on k-mer overlap. This could be both very stringent (50% of k-mers is a pretty high bar!) and very accepting of overlapping strains (you could pull out over 100,000+ Salmonella genomes for a human microbiome with this criterion).

Instead of f_match_orig, I guess I'd suggest looking at the new ANI columns (see docs or ask questions here!)

I'm not quite sure what to recommend re strains. sourmash gather deals with them using the minimum metagenome cover approach / f_uniq_to_query and f_match, but there is some degree of arbitrariness involved. I think it depends on what you're trying to do with classification!

@taylorreiter might have some input here, too.

ctb avatar Aug 16 '22 15:08 ctb

Thanks for the feedback. What I am trying to do is determine microbial composition of metagenomic samples similar to SourceTracker

krastegar avatar Aug 16 '22 22:08 krastegar

Thanks for the feedback. What I am trying to do is determine microbial composition of metagenomic samples similar to SourceTracker

sounds neat!

my only specific thought for you is that in Meta-analysis of metagenomes via machine learning and assembly graphs reveals strain switches in Crohn’s disease, we (@taylorreiter ;) applied random forests directly to the hashes themselves, rather than their annotations as found via gather. This has some advantages (no annotation bias!) and some disadvantages (there are lots of hashes; and you have to annotate at some point, and many of the hashes will be unknown).

Figures 1 and 4 in that paper are probably the most immediately relevant.

ctb avatar Aug 18 '22 12:08 ctb

Thank you @ctb ! This is extremely helpful!

krastegar avatar Aug 19 '22 16:08 krastegar

hi @krastegar, very fun application, I hope you update us on how your work progresses!

As titus stated, I used the hashes directly to train a random forest classifier. I'm going to ignore all of the reasons this was computationally annoying and focus on the reasons it was biologically annoying, and then brainstorm some ways to get around that.

Using k-mers is good because

  • they don't rely on references so everything can be included
  • it doesn't matter what microbe contributed the sequence, it can just be counted (think 16s where there is a massive amount of shared sequence between distantly related genomes)
  • they're 1:1 between samples; if it's in any sample, it's counted no matter what

k-mers are not good because

  • they're super specific
  • they represent a small amount of sequence

One potential work around for the super specific part would be to use a smaller k size (I used 31, maybe 21 could be better) or use an amino acid alphabet (see https://github.com/dib-lab/distillerycats/ for some underdeveloped code around these ideas).

As you've already discovered, another way to get around these short comings is to summarize up to the genome level. But this also has some problems:

  • some sequences are missed if we don't have a reference for them
  • while gather returns the best strain-level match for genomes in a database :tada:, this can be pretty fragile when using something like e. coli or salmonella where we have TONS of these genomes. Two metagenomes that contain very similar e. coli strains could get different match names. Random forests won't know that these are very similar things, which can cause your signal to get inappropriately distorted. If your metagenomes are composed of genomes that don't have a lot of representatives, this might not be a problem. To get around this problem, you can either summarize to the species level using sourmash taxonomy or you can use something like the GTDB representatives database to make sure only one genome is present per species, thus making sure the same strain is always identified. Or you could come up with some cool new technique to do something like unifrac distance with random forests...which I have no actual idea how to achieve.

I hope this is helpful!

taylorreiter avatar Aug 20 '22 22:08 taylorreiter

Awesome! Thank you for breaking all of that down for me. I'll apply this advice to my project and keep you all updated!

krastegar avatar Aug 21 '22 01:08 krastegar

k - I'll close this issue for now, but please feel free to post questions or comments here or in a new issue!

ctb avatar Aug 21 '22 13:08 ctb