tsinfer icon indicating copy to clipboard operation
tsinfer copied to clipboard

Add two-bit encoding for generate ancestors

Open jeromekelleher opened this issue 2 years ago • 4 comments

Following up on #809, add a two bit genotype encoding that'll support missing data and three alleles.

Probably not a priority for the moment since most (phased) datasets that are sufficiently large to need this seem to be strictly biallelic.

jeromekelleher avatar Mar 28 '23 14:03 jeromekelleher

Great, note that UKB has a significant fraction of tri-or-more sites. Our current plan was just to filter them.

benjeffery avatar Mar 29 '23 12:03 benjeffery

Also there was some talk about re-imposing the missing sites over the top of the phased datasets. But that's waay down the line.

hyanwong avatar Mar 29 '23 12:03 hyanwong

I'm not sure the ancestor generator supports multi allelic anyway, so I guess it's just missing data we need to consider for now

jeromekelleher avatar Mar 29 '23 12:03 jeromekelleher

Yep, we have zero missingness in both datasets.

benjeffery avatar Mar 29 '23 12:03 benjeffery