sgkit
sgkit copied to clipboard
Consider adding mean imputation function
For association testing and PCA (at least), it may be useful to have a function that imputes dosages/allele counts. With floating point values (i.e. from bgen), this can be very simple as a user, e.g. ds.call_genotype_probability.fillna(ds.call_genotype_probability.mean(dim="samples")). With alternate allele counts having a sentinel integer, it is a little more complicated. The best way to do this would be to expose a function like this publicly: https://github.com/pystatgen/sgkit/blob/20f4992bf1e4c1e09152ad930afd859cd012281d/sgkit/stats/pc_relate.py#L16-L22
I think a good signature for this would be:
def mean_impute(ds: Dataset, variable: str, dim: str, merge: bool) -> Dataset
This is an unusual use case since the resulting variable should probably be something like f"{variable}_imputed" by default. Setting merge=False and doing the rename manually would be one escape hatch for working with a different naming convention.
For reference, this is a workaround for any users that need this in the meantime:
def mean_impute(ds, variable, dim):
unmasked = ~ds[f"{variable}_mask"]
return ds.assign(**{
f"{variable}_imputed": ds[variable].where(
unmasked,
ds[variable].where(unmasked).mean(dim=dim)
)})
# To mean impute values for variants:
ds = mean_impute(ds, 'call_genotype_probability', 'samples')