Restructure statistics.html
After a brief chat with @petrelharp we thought it might be time to revisit the general structure of https://tskit.dev/tskit/docs/stable/stats.html, because (for instance) it doesn't mention the LDcalc stuff, or have obvious room for @gtsambos 's IBD stuff, both of which can be thought of as "multi-site stats". We should also add GNN, which is a single site stat, and probably leave room somewhere for talking about time windows.
The current structure can be roughly summarised as follows:
# Statistics
intro
----------------
## Available statistics
### Single site stats (just lists them out below)
Some basic stats (e.g. diversity, AFS)
#### Patterson’s f statistics
f4, f3, f2
#### Y statistics
Y3, Y2
#### Trait correlations
#### Derived statistics
#### General methods
### AFS
complex stuff about gotchas when calculating AFS for multi-allelic sites
----------------
## Interface
intro
### Mode
### Sample sets and indexes
#### One-way methods
#### Multi-way methods
### Windows
### Span normalize
### Output format
### Output dimensions
----------------
## General API
stuff here about general API for single site stats, e.g. polarization
I'll follow up below with a suggestion for an updated structure.
I've worked up one rearrangement into a rough PR at https://github.com/tskit-dev/tskit/pull/1499, which puts the main list of "Available stats" into a table rather than a set of lists with headers, as I found the long-list style was difficult to see at-a-glance:
Links from the titles in the table take you to a "notes" section below, so the AFS details fit there nicely. The doc structure then allows for a place for the multi-site stats docs (assuming they don't fit into the "General API":
# Statistics
intro - introduce single-site vs multi-site stats
## Available stats
*new table*
### Notes
#### Patterson’s f statistics
#### Y statistics
#### Trait correlations
#### Derived statistics
#### AFS
## Single site stats
### Interface
#### Mode
...
### General API
#### Polarization
...
## Multi site stats
todo
Hey @hyanwong, it's about these time statistics got a top-down restructure -- thanks for taking this on. I have two comments:
-
Technically the ibd methods in tskit at the moment aren't returning statistics on the tree sequences, they're returning lists of IBD segments -- which can then be processed/aggregated in various ways to make statistics, but this isn't yet implemented. There are certainly lots of useful stats that might be of interest to tskit users , and most of them will be straightforward to code up -- perhaps it would be worth chatting about this with others at some point (@petrelharp)?
-
What is meant be 'multi-way' statistics, exactly? Does this mean the user has to supply several sets of sample nodes, instead of just one? Is this the most meaningful way to be categorising the statistics?
Thanks @gtsambos: I don't know much about the IBD interface. I was hoping you might write something about it from an analysis point of view. Peter and I have been discussing the "multi-way" thing in https://github.com/tskit-dev/tskit/pull/1499. See if you prefer the list-type layout there.
Technically the ibd methods in tskit at the moment aren't returning statistics on the tree sequences, they're returning lists of IBD segments -- which can then be processed/aggregated in various ways to make statistics, but this isn't yet implemented.
Good point. I think we can still refer to them here, though. And, yes, let's think about summary stats!
What is meant be 'multi-way' statistics, exactly? Does this mean the user has to supply several sets of sample nodes, instead of just one? Is this the most meaningful way to be categorising the statistics?
Right - like Fst, for instance. It's a meaningful distinction because the interface is different - for multi-way stats you pass in an indexes argument to say which sample sets you're comparing.
...this isn't yet implemented.
Good point. I think we can still refer to them here, though. And, yes, let's think about summary stats!
Please do!
like Fst, for instance. It's a meaningful distinction because the interface is different - for multi-way stats you pass in an
indexesargument to say which sample sets you're comparing.
I also think it's a meaningful philosophical distinction: are you comparing between groups or genomes, or within them?
Also to note here - we should document the use of keep_intervals to calculate a single statistic for a given region (and note that this won't work for mode=site until https://github.com/tskit-dev/tskit/issues/287 is fixed.
After discussion with @jeromekelleher we think that people coming in to the page might want to see stuff about modes, windows etc at a high level. That's especially the case if we implement windows etc in the multi-site stats, which I guess is the plan. So we should plan to move e.g.
### Interface
to
## Interface
and
#### Mode
to
### Mode