sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Add function to compute regenie association statistics

Open eric-czech opened this issue 3 years ago • 1 comments

The current regenie implementation produces LOCO predictions with the shape: (contigs, samples, outcomes). It isn't immediately obvious in the documentation how to use this then since applying gwas_linear_regression to the results requires pairing variants to run an association test on with the appropriate subarray of shape (samples, outcomes) (i.e. for the held-out contig).

Thanks for pointing this out @EpiSlim.

eric-czech avatar Jun 02 '21 18:06 eric-czech

This is effectively what the function needs to do in order to rebuild a dataset with association test statistics computed on a per-contig basis (cc: @EpiSlim):

import xarray as xr
import sgkit as sg
import numpy as np

ds = (
    # Simulate a dataset with at least 2 contigs for LOCO
    sg.simulate_genotype_call_dataset(1000, 500, n_contig=2, missing_pct=None)
    # Count alternate alleles
    .assign(call_alternate_allele_count=lambda ds: ds.call_genotype.sum(dim='ploidy'))
    # Add fake covariates/traits
    .assign(sample_traits=lambda ds: (("samples", "outcomes"), np.random.rand(ds.dims['samples'], 10)))
    .assign(sample_covariates=lambda ds: (("samples", "covariates"), np.random.rand(ds.dims['samples'], 3)))
    # Get LOCO predictions for traits
    .pipe(sg.regenie, dosage='call_alternate_allele_count', covariates='sample_covariates', traits='sample_traits')
    .rename_dims(outcomes='traits')
)
ds.loco_prediction
# <xarray.DataArray 'loco_prediction' (contigs: 2, samples: 500, traits: 10)>
# dask.array<stack, shape=(2, 500, 10), dtype=float64, chunksize=(1, 50, 10), chunktype=numpy.ndarray>
# Dimensions without coordinates: contigs, samples, traits
# Attributes:
#     description:  Predictions from best meta ridge model omitting coefficients for variant blocks within individual contigs (LOCO approximation)
#     comment:      REGENIE's loco_prediction (contigs, samples, outcomes). LOCO predictions\nresulting from Stage 2 predictions ignoring effects for variant blocks on\nheld out contigs. This will be absent if the data provided does not contain\nat least 2 contigs.

# Loop through the contigs and get association statistics for the combination of
# variants within that contig and the LOCO predictions that apply to it, and then
# concatenate everything back together along the variants dimension
xr.concat([
        sg.gwas_linear_regression(
            ds
            # Subset to only the variants for this contig
            .pipe(lambda ds: ds.sel(variants=ds.variant_contig == contig))
            # Subtract the LOCO predictions for this contig from the original trait values and regress on those instead
            .assign(sample_traits=lambda ds: ds.sample_traits - ds.loco_prediction.isel(contigs=contig))
            , 
            dosage='call_alternate_allele_count', 
            covariates='sample_covariates', 
            traits='sample_traits'
        )
        for contig in ds.variant_contig.to_series().sort_values().drop_duplicates()
    ], 
    dim='variants'
).compute()
# <xarray.Dataset>
# Dimensions:                      (alleles: 2, alphas: 5, blocks: 2, contigs: 2, covariates: 3, ploidy: 2, samples: 500, traits: 10, variants: 1000)
# Dimensions without coordinates: alleles, alphas, blocks, contigs, covariates, ploidy, samples, traits, variants
# Data variables:
#     variant_beta                 (variants, traits) float64 -0.005385 -0.03918 ... -0.015 -0.007268
#     variant_t_value              (variants, traits) float64 -0.2813 -2.038 ... -0.8205 -0.4038
#     variant_p_value              (variants, traits) float64 0.7786 0.0421 0.3322 ... 0.4123 0.6865
#     base_prediction              (variants, blocks, alphas, samples, traits) float64 0.5564 ... 0...
#     meta_prediction              (variants, samples, traits) float64 0.5419 0.4376 ... 0.4648 0.4819
#     loco_prediction              (variants, contigs, samples, traits) float64 0.5268 ... 0.2042
#     variant_contig               (variants) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1
#     variant_position             (variants) int64 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499
#     variant_allele               (variants, alleles) |S1 b'A' b'G' b'G' b'G' ... b'G' b'T' b'G' b'T'
#     sample_id                    (variants, samples) <U4 'S0' 'S1' 'S2' ... 'S497' 'S498' 'S499'
#     call_genotype                (variants, samples, ploidy) int8 0 0 1 0 1 0 0 1 ... 0 1 1 1 0 1 0
#     call_genotype_mask           (variants, samples, ploidy) bool False False False ... False False
#     call_alternate_allele_count  (variants, samples) int64 0 1 1 1 1 1 0 1 2 0 ... 1 1 1 1 1 0 2 1 1
#     sample_traits                (variants, samples, traits) float64 0.1144 -0.2051 ... 0.4461
#     sample_covariates            (variants, samples, covariates) float64 0.273 0.1751 ... 0.5335
# Attributes:
#     contigs:  [0, 1]

There are two pretty big warts here:

  1. We need a function to count alternate alleles: https://github.com/pystatgen/sgkit/issues/282
  2. The dimension name for phenotypes needs to be unified: https://github.com/pystatgen/sgkit/issues/597

eric-czech avatar Jun 03 '21 11:06 eric-czech