smudgeplot icon indicating copy to clipboard operation
smudgeplot copied to clipboard

Request for feedback on smudgeplot optimization and interpretation (diploid? haploid?)

Open margaretc-ho opened this issue 1 year ago • 6 comments

Hi Kamil and folks,

Thanks for creating this clever method to examine genome structure using k-mers and releasing the new beta version for us to test out. I'm hoping to get some feedback on how to optimize these plots and interpret the results from running it on our data. I have been using jellyfish/GenomeScope2.0 and smudgeplot (the latest sploidyplot branch) using FASTK for this analysis.

We have two protist species we are working on, one called TM and the other called TC of unknown ploidy. We are trying to determine the ploidy of these two species using your software:

TM species -- Genomescope2.0 plot (p=2 yielded the lowest error):

transformed_log_plot transformed_linear_plot

TM species -- FASTK and smudgeplot commands

FastK -v -t4 -k31 -M16 -T16 9076_9077_PE_FRAG_R1_R2_readscombined.fastq.gz -NTmFastK_Table
PloidyPlot -e12 -k -v -T16 TmFastK_Table 
smudgeplot.py plot -t Tm_gIllumina_MiSeq -o Tm_gIllumina_MiSeq_run FastK_Table_text.smu 

TM species -- Smudgeplot: image image

Smudgeplot seemed to run without errors or warnings in the case of species TM

TC species -- Genomescope2.0 plot (p=2 yielded the lowest error):

transformed_log_plot transformed_linear_plot

TC species -- FASTK and smudgeplot commands

FastK -v -t4 -k31 -M16 -T16 ${Tc_gIlluminatrim_1_fq_gz} ${Tc_gIlluminatrim_2_fq_gz} -NTc_gIlluminaFastK_Table
PloidyPlot -e12 -k -v -T16 Tc_gIlluminaFastK_Table
smudgeplot.py plot -t Tc_gIllumina -o Tc_gIllumina_run Tc_gIlluminaFastK_Table_text.smu 

TC species Smudgeplot: image image

Smudgeplot gave the following warning in the case of TC:

detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         35
detecting two smudges at the same positions, not enough data for this number of bins lowering number of bins to         30
!! Warning, your coverage filter on the lower end (L =  12      ) is higher than half of the 1n coverage estimate ( 1n / 2 =    10.55
If the real 1n coverage is half of your estimate you would not picked it up due to the filtering.
If you have sufficient coverage, consider reruning the analysis with lower L (something like (1n / 2) - 5)
One good way for verificaiton would be to compare it to GenomeScope estimate of haploid coverage
1 peaks of kmer pairs detected with coverage < (1n_coverage * 2) = 42.2
  kmers_in_peak[#] kmers_in_peak[proportion] summit B / (A + B) summit A + B
1           202830                      0.35               0.48        26.12
This might be due to kmers with sequencing errors present in the kmer dump.
Increasing L might help remove erroneous kmers.

First, how do I resolve this warning/error? Reading it, the first part says to "rerun analysis with lower L" but then the second part advises to "increasing L might help remove erroneous kmers". Should I go lower or higher? :P How do I vary L in smudgeplot plot and how do I determine what is an appropriate L? (i.e what level would you suggest given the kmer coverage in the Genomescope plot?)

Second, do you think that these data are consistent with the smudgeplot prediction that these organisms are diploid? One of the questions we also had was whether Genomescope2.0 and Smudgeplot give enough information to distinguish diploid vs. haploid organisms. In the next comment, I will also show data from trying to run Genomescope2.0 and smudgeplot on a known haploid organism for comparison.

Thank you!

margaretc-ho avatar Sep 13 '23 22:09 margaretc-ho