tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Restructure statistics.html

Open hyanwong opened this issue 4 years ago • 7 comments

After a brief chat with @petrelharp we thought it might be time to revisit the general structure of https://tskit.dev/tskit/docs/stable/stats.html, because (for instance) it doesn't mention the LDcalc stuff, or have obvious room for @gtsambos 's IBD stuff, both of which can be thought of as "multi-site stats". We should also add GNN, which is a single site stat, and probably leave room somewhere for talking about time windows.

The current structure can be roughly summarised as follows:

# Statistics
intro
----------------
## Available statistics

### Single site stats (just lists them out below)
Some basic stats (e.g. diversity, AFS)
#### Patterson’s f statistics
f4, f3, f2
#### Y statistics
Y3, Y2
#### Trait correlations
#### Derived statistics
#### General methods

### AFS
complex stuff about gotchas when calculating AFS for multi-allelic sites

----------------
## Interface
intro
### Mode
### Sample sets and indexes
#### One-way methods
#### Multi-way methods
### Windows
### Span normalize
### Output format
### Output dimensions

----------------
## General API
stuff here about general API for single site stats, e.g. polarization

I'll follow up below with a suggestion for an updated structure.

hyanwong avatar Jun 17 '21 09:06 hyanwong

I've worked up one rearrangement into a rough PR at https://github.com/tskit-dev/tskit/pull/1499, which puts the main list of "Available stats" into a table rather than a set of lists with headers, as I found the long-list style was difficult to see at-a-glance:

Screenshot 2021-06-17 at 12 26 36

Links from the titles in the table take you to a "notes" section below, so the AFS details fit there nicely. The doc structure then allows for a place for the multi-site stats docs (assuming they don't fit into the "General API":

# Statistics
intro - introduce single-site vs multi-site stats

## Available stats
*new table*
### Notes
#### Patterson’s f statistics
#### Y statistics
#### Trait correlations
#### Derived statistics
#### AFS

## Single site stats
### Interface
#### Mode
...
### General API
#### Polarization
...

## Multi site stats
todo

hyanwong avatar Jun 17 '21 11:06 hyanwong

Hey @hyanwong, it's about these time statistics got a top-down restructure -- thanks for taking this on. I have two comments:

  • Technically the ibd methods in tskit at the moment aren't returning statistics on the tree sequences, they're returning lists of IBD segments -- which can then be processed/aggregated in various ways to make statistics, but this isn't yet implemented. There are certainly lots of useful stats that might be of interest to tskit users , and most of them will be straightforward to code up -- perhaps it would be worth chatting about this with others at some point (@petrelharp)?

  • What is meant be 'multi-way' statistics, exactly? Does this mean the user has to supply several sets of sample nodes, instead of just one? Is this the most meaningful way to be categorising the statistics?

gtsambos avatar Jun 18 '21 03:06 gtsambos

Thanks @gtsambos: I don't know much about the IBD interface. I was hoping you might write something about it from an analysis point of view. Peter and I have been discussing the "multi-way" thing in https://github.com/tskit-dev/tskit/pull/1499. See if you prefer the list-type layout there.

hyanwong avatar Jun 18 '21 08:06 hyanwong

Technically the ibd methods in tskit at the moment aren't returning statistics on the tree sequences, they're returning lists of IBD segments -- which can then be processed/aggregated in various ways to make statistics, but this isn't yet implemented.

Good point. I think we can still refer to them here, though. And, yes, let's think about summary stats!

What is meant be 'multi-way' statistics, exactly? Does this mean the user has to supply several sets of sample nodes, instead of just one? Is this the most meaningful way to be categorising the statistics?

Right - like Fst, for instance. It's a meaningful distinction because the interface is different - for multi-way stats you pass in an indexes argument to say which sample sets you're comparing.

petrelharp avatar Jun 18 '21 14:06 petrelharp

...this isn't yet implemented.

Good point. I think we can still refer to them here, though. And, yes, let's think about summary stats!

Please do!

like Fst, for instance. It's a meaningful distinction because the interface is different - for multi-way stats you pass in an indexes argument to say which sample sets you're comparing.

I also think it's a meaningful philosophical distinction: are you comparing between groups or genomes, or within them?

hyanwong avatar Jun 18 '21 14:06 hyanwong

Also to note here - we should document the use of keep_intervals to calculate a single statistic for a given region (and note that this won't work for mode=site until https://github.com/tskit-dev/tskit/issues/287 is fixed.

hyanwong avatar Jul 08 '21 08:07 hyanwong

After discussion with @jeromekelleher we think that people coming in to the page might want to see stuff about modes, windows etc at a high level. That's especially the case if we implement windows etc in the multi-site stats, which I guess is the plan. So we should plan to move e.g.

### Interface

to

## Interface

and

#### Mode

to

### Mode

hyanwong avatar Dec 10 '21 12:12 hyanwong