tskit icon indicating copy to clipboard operation
tskit copied to clipboard

document how to compute "per-variant-site" instead of "per-site" statistics

Open petrelharp opened this issue 4 years ago • 4 comments

It'd been pointed out that other tools like

  • ADMIXTOOLS (or, the frontend admixr)
  • treemix

report statistics (eg f3) calculated from only polymorphic sites instead of the whole genome as we do. We should clarify this in the docs (particularly for f2/f3/f4 maybe) and provide an example somewhere of doing the conversion.

petrelharp avatar Nov 15 '21 18:11 petrelharp

Also we might want to point out that even with mode="site" this would (I believe) compute the stats at each site defined in the TS, regardless of whether that site was actually polymorphic or not. I wonder if it is worth having a mode="variable_site" setting, or whether this would just be confusing?

hyanwong avatar Nov 17 '21 11:11 hyanwong

We need to make a distinction between "site" (entry in the site table, possibly with mutations, usually variable but might not be) and "position on the genome" (in the gaps between sites). Which one are you talking about @hyanwong? We can't call them "variable sites" because they're not necessarily variable and that would be even more confusing.

jeromekelleher avatar Nov 17 '21 13:11 jeromekelleher

I mean that our current stats, with mode="site", report stats for all sites defined in the sites table (I think). If other tools report statistics only at polymorphic sites, then they will not necessarily give the same answer, because even if a site is in the sites table, it is not guaranteed to be polymorphic. So perhaps we additionally want a mode="polymorphic/variable site" which returns stats only at explicitly variable sites? It presumably makes no difference if we are averaging along the genome, but maybe there are circumstances where it does?

hyanwong avatar Nov 17 '21 13:11 hyanwong

This is (a) pretty straightforward and (b) rather confusing because of polyallelic sites. To do a "per variant" statistic we just need to do:

 ts.f4(sample_sets, windows) / ts.segregating_sites(union_of_sample_sets, windows)

If all sites are biallelic then this does exactly what you're saying, Yan: it doesn't count sites that are not polymorphic in the sample sets. The only confusing thing is for more-than-two-allelic sites: if a site has 3 alleles present in the sample sets then the site counts twice. But, this is actually correct, because that's what ts.f4( ) does also: it computes f4 for each allele (i.e., for presence/absence of each allele). And, to my knowledge no other software does these statistics for polyallelic sites at all, so there's no worries about agreeing with their definitions.

petrelharp avatar Nov 17 '21 15:11 petrelharp