sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Ancestral vs. derived allele

Open hammer opened this issue 4 years ago • 5 comments

Motivated by @hyanwong at https://github.com/pystatgen/sgkit/discussions/580

Some popgen methods need to know which allele at each site is the ancestral allele (cf. https://biology.stackexchange.com/questions/19159/ancestral-allele-explanation).

We should augment our data model to optionally store allele state.

A verbose approach would be to add a Boolean data variable named variant_allele_ancestral_state along the (variants, alleles) dimensions that would be True when the allele is ancestral and False otherwise.

Given that the alleles dimension has a fixed length which is often > 2, I suppose it’s a design decision to determine how to distinguish a derived allele from a null allele at a site with fewer alleles than the length of the alleles dimension.

This also makes me think we should rename the variants dimension to sites, but that can be addressed in a separate issue.

hammer avatar May 27 '21 03:05 hammer

I'd also suggest variant_allele_ancestral_index as a 1D (variants,) array since that would be a very convenient form for fancy indexing into variant_alleles and would also be directly usable for computations on call_genotype (e.g. ancestral allele counts), and it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

eric-czech avatar May 27 '21 10:05 eric-czech

it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.

I think you can safely assume that for a given dataset, there is only ever 1 ancestral allele for a site (or it is unknown).

edit - A more sophisticated version could have a probability of each allele being the ancestral one, but I think that's needless complexity for most people.

hyanwong avatar May 27 '21 10:05 hyanwong

I'd vote for a variant of @eric-czech's suggestion above: variant_allele_ancestral_state which is an index into the variant_alleles array. There's +/- for explicitly adding the _index suffix, but since users will need to be comfortable with working with indexes in the first place for call_genotype, I think we can leave it out.

jeromekelleher avatar Jun 01 '21 18:06 jeromekelleher

I think this is addressed now in SGkit, isn't it @benjeffery ? So I can close this?

hyanwong avatar Jul 03 '23 16:07 hyanwong

There is no explicit support, and nothing has been added to the data model as far as I'm aware. I think this stays open for now.

benjeffery avatar Jul 04 '23 10:07 benjeffery