sgkit
sgkit copied to clipboard
Ancestral vs. derived allele
Motivated by @hyanwong at https://github.com/pystatgen/sgkit/discussions/580
Some popgen methods need to know which allele at each site is the ancestral allele (cf. https://biology.stackexchange.com/questions/19159/ancestral-allele-explanation).
We should augment our data model to optionally store allele state.
A verbose approach would be to add a Boolean data variable named variant_allele_ancestral_state along the (variants, alleles) dimensions that would be True when the allele is ancestral and False otherwise.
Given that the alleles dimension has a fixed length which is often > 2, I suppose it’s a design decision to determine how to distinguish a derived allele from a null allele at a site with fewer alleles than the length of the alleles dimension.
This also makes me think we should rename the variants dimension to sites, but that can be addressed in a separate issue.
I'd also suggest variant_allele_ancestral_index as a 1D (variants,) array since that would be a very convenient form for fancy indexing into variant_alleles and would also be directly usable for computations on call_genotype (e.g. ancestral allele counts), and it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.
it eschews the ragged alleles problem assuming there is only ever 1 ancestral allele.
I think you can safely assume that for a given dataset, there is only ever 1 ancestral allele for a site (or it is unknown).
edit - A more sophisticated version could have a probability of each allele being the ancestral one, but I think that's needless complexity for most people.
I'd vote for a variant of @eric-czech's suggestion above: variant_allele_ancestral_state which is an index into the variant_alleles array. There's +/- for explicitly adding the _index suffix, but since users will need to be comfortable with working with indexes in the first place for call_genotype, I think we can leave it out.
I think this is addressed now in SGkit, isn't it @benjeffery ? So I can close this?
There is no explicit support, and nothing has been added to the data model as far as I'm aware. I think this stays open for now.