msmc-tools
msmc-tools copied to clipboard
Two subspecies populations don't coalesce further back in the past
Hello, I am facing a very unexpected result from MSMC2 and am very confused about what happened. Any insights would be helpful! Lingonberry_cross_coalescence_refgenome_minus.pdf Effective population size looked like this, which was also unexpectedly low on the recent years: Lingonberry_effective_population_size_refgenome_minus.pdf
I have a diploid plant species which has two identified subspecies based on their geographical origin: European and North American. My inputs are one individual per subspecies, and am treating these as two populations (paternal and maternal haplotype from each individual but unphased genome assembly). I want to know when the two subspecies/populations split in time, and so I performed the cross-coalescence analysis, assuming four haplotypes (2 from each subspecies) in total.
I've mapped short-read whole genome sequencing data of ~14X and ~32X coverage, respectively, on the North American genome assembly as the reference genome, called variants, made input mask files for genomic repeats and mappable regions based on sequencing coverage, and performed MSMC2 as follows:
#4. Single population - effective population size estimate with MSMC2.
for chr in `cat Lingonberry_ragtag.scaffold.chrnames.txt`;
do
generate_multihetsep.py --chr $chr \
--mask $INDIR/Lingonberry_minus_F122991_paired_reads_${chr}.mask.bed.gz \
--mask $INDIR/Lingonberry_minus_F122990_paired_reads_RedCandy_${chr}.mask.bed.gz \
--negative_mask $INDIR/Lingonberry_minus_excluded_regions.mask.bed \
$INDIR/Lingonberry_minus_F122991_paired_reads_${chr}.vcf.gz \
$INDIR/Lingonberry_minus_F122990_paired_reads_RedCandy_${chr}.vcf.gz \
> $OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt
done
msmc2 -t 1 -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_minus.msmc2 -I 0,1 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt
msmc2 -t 1 -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_RedCandy.msmc2 -I 2,3 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt
#5. Population separation history - estimate of split between two populations
msmc2 -t 1 -I 0-2,0-3,1-2,1-3 -s -p 1*2+16*1+1*2 -o $OUTDIR/Lingonberry_minus_RedCandy.msmc2 \
$OUTDIR/Lingonberry_minus_RedCandy_chr*.multihetsep.txt
#6. Combine results for two population coalescence estimate
combineCrossCoal.py $OUTDIR/Lingonberry_minus_RedCandy.msmc2.final.txt $OUTDIR/Lingonberry_minus.msmc2.final.txt \
$OUTDIR/Lingonberry_RedCandy.msmc2.final.txt > Lingonberry_minus_RedCandy.msmc2.combined.msmc2.final.txt
Do you spot a critical mistake on the code? Why does my cross-coalescence plot look like this?
Thank you, Kaede
Is the plant selfing? Could the extremely low population size be due to long blocks of homozygosity? That would then also screw up the cross coalescence rate
Hmm the plant can self, but the main reproductive strategy is known to be outcrossing in nature... and that's why the results were unexpected. Our (draft) heterozygosity genome-wide plot looks like this: RedCandy_percent_het_masked_vcf.pdf minus_percent_het_masked_vcf.pdf There could be some runs of homozygosity where the heterozygote calls are reduced but I am not convinced yet... The gaps with no points are missing data, when there is no alleles called within that 100kb window.
If the two subspecies are both showing enough sign of inbreeding/homozygosity, could you please give me a brief explanation of why that could screw up the cross coalescence analysis? (I am very new to popgen models, sorry for taking a step-back!)
Thank you for your help.