sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

PC-Relate and Bi-allelic sites

Open tristanpwdennis opened this issue 1 year ago • 3 comments

Hi there, I'm playing around with PC-Relate estimation.

I'm using the malariagen_data package to load some data, then using sgkit to remove multiallelic/singleton/invariant/nonsegregating sites, perform a PCA, and run PC-Relate.

Everything up to and including PCA works fine. When I attempt to call PC-Relate (as per the docs), I get the following error:

ValueError: PC Relate only works for biallelic genotypes

In the traceback this is a result of the data failing the check:

"alleles" in ds.dims and ds.dims["alleles"] != 2:

This makes sense - the malariagen_data are called across all 4 possible alleles. So, even though I have filtered the data to only include biallelic sites, the value in the 'alleles' dimension for the datastore is still '4'. Is there a way to reset the 'alleles' attribute, or to disable this check so that PC-Relate can run?

In this I suppose there is a more general question around how to handle data that are multiallelic (or have the option to be multiallelic). I suppose it could be possible to write a function to check that the filtering has worked (so for each site it checks to make sure nalleles = 2 in each or something), and then reset the 'alleles' dimension to two if the check is passed. This seems quite cumbersome and perhaps there is a more elegant way of implementing this. But I am quite inexperienced in development myself and relatively new to sgkit/scikit allel/malariagen_data.

Apologies if I am missing something! Thanks for your assistance and the library.

Tristan

tristanpwdennis avatar Jul 26 '23 10:07 tristanpwdennis