sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Consider adding mean imputation function

Open eric-czech opened this issue 4 years ago • 0 comments

For association testing and PCA (at least), it may be useful to have a function that imputes dosages/allele counts. With floating point values (i.e. from bgen), this can be very simple as a user, e.g. ds.call_genotype_probability.fillna(ds.call_genotype_probability.mean(dim="samples")). With alternate allele counts having a sentinel integer, it is a little more complicated. The best way to do this would be to expose a function like this publicly: https://github.com/pystatgen/sgkit/blob/20f4992bf1e4c1e09152ad930afd859cd012281d/sgkit/stats/pc_relate.py#L16-L22

I think a good signature for this would be:

def mean_impute(ds: Dataset, variable: str, dim: str, merge: bool) -> Dataset

This is an unusual use case since the resulting variable should probably be something like f"{variable}_imputed" by default. Setting merge=False and doing the rename manually would be one escape hatch for working with a different naming convention.


For reference, this is a workaround for any users that need this in the meantime:

def mean_impute(ds, variable, dim):
    unmasked = ~ds[f"{variable}_mask"]
    return ds.assign(**{
        f"{variable}_imputed": ds[variable].where(
        unmasked, 
        ds[variable].where(unmasked).mean(dim=dim)
    )})
    
# To mean impute values for variants:
ds = mean_impute(ds, 'call_genotype_probability', 'samples')

eric-czech avatar Jun 15 '21 12:06 eric-czech