sgkit
sgkit copied to clipboard
Add function to compute regenie association statistics
The current regenie implementation produces LOCO predictions with the shape: (contigs, samples, outcomes)
. It isn't immediately obvious in the documentation how to use this then since applying gwas_linear_regression
to the results requires pairing variants to run an association test on with the appropriate subarray of shape (samples, outcomes)
(i.e. for the held-out contig).
Thanks for pointing this out @EpiSlim.
This is effectively what the function needs to do in order to rebuild a dataset with association test statistics computed on a per-contig basis (cc: @EpiSlim):
import xarray as xr
import sgkit as sg
import numpy as np
ds = (
# Simulate a dataset with at least 2 contigs for LOCO
sg.simulate_genotype_call_dataset(1000, 500, n_contig=2, missing_pct=None)
# Count alternate alleles
.assign(call_alternate_allele_count=lambda ds: ds.call_genotype.sum(dim='ploidy'))
# Add fake covariates/traits
.assign(sample_traits=lambda ds: (("samples", "outcomes"), np.random.rand(ds.dims['samples'], 10)))
.assign(sample_covariates=lambda ds: (("samples", "covariates"), np.random.rand(ds.dims['samples'], 3)))
# Get LOCO predictions for traits
.pipe(sg.regenie, dosage='call_alternate_allele_count', covariates='sample_covariates', traits='sample_traits')
.rename_dims(outcomes='traits')
)
ds.loco_prediction
# <xarray.DataArray 'loco_prediction' (contigs: 2, samples: 500, traits: 10)>
# dask.array<stack, shape=(2, 500, 10), dtype=float64, chunksize=(1, 50, 10), chunktype=numpy.ndarray>
# Dimensions without coordinates: contigs, samples, traits
# Attributes:
# description: Predictions from best meta ridge model omitting coefficients for variant blocks within individual contigs (LOCO approximation)
# comment: REGENIE's loco_prediction (contigs, samples, outcomes). LOCO predictions\nresulting from Stage 2 predictions ignoring effects for variant blocks on\nheld out contigs. This will be absent if the data provided does not contain\nat least 2 contigs.
# Loop through the contigs and get association statistics for the combination of
# variants within that contig and the LOCO predictions that apply to it, and then
# concatenate everything back together along the variants dimension
xr.concat([
sg.gwas_linear_regression(
ds
# Subset to only the variants for this contig
.pipe(lambda ds: ds.sel(variants=ds.variant_contig == contig))
# Subtract the LOCO predictions for this contig from the original trait values and regress on those instead
.assign(sample_traits=lambda ds: ds.sample_traits - ds.loco_prediction.isel(contigs=contig))
,
dosage='call_alternate_allele_count',
covariates='sample_covariates',
traits='sample_traits'
)
for contig in ds.variant_contig.to_series().sort_values().drop_duplicates()
],
dim='variants'
).compute()
# <xarray.Dataset>
# Dimensions: (alleles: 2, alphas: 5, blocks: 2, contigs: 2, covariates: 3, ploidy: 2, samples: 500, traits: 10, variants: 1000)
# Dimensions without coordinates: alleles, alphas, blocks, contigs, covariates, ploidy, samples, traits, variants
# Data variables:
# variant_beta (variants, traits) float64 -0.005385 -0.03918 ... -0.015 -0.007268
# variant_t_value (variants, traits) float64 -0.2813 -2.038 ... -0.8205 -0.4038
# variant_p_value (variants, traits) float64 0.7786 0.0421 0.3322 ... 0.4123 0.6865
# base_prediction (variants, blocks, alphas, samples, traits) float64 0.5564 ... 0...
# meta_prediction (variants, samples, traits) float64 0.5419 0.4376 ... 0.4648 0.4819
# loco_prediction (variants, contigs, samples, traits) float64 0.5268 ... 0.2042
# variant_contig (variants) int64 0 0 0 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1 1
# variant_position (variants) int64 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499
# variant_allele (variants, alleles) |S1 b'A' b'G' b'G' b'G' ... b'G' b'T' b'G' b'T'
# sample_id (variants, samples) <U4 'S0' 'S1' 'S2' ... 'S497' 'S498' 'S499'
# call_genotype (variants, samples, ploidy) int8 0 0 1 0 1 0 0 1 ... 0 1 1 1 0 1 0
# call_genotype_mask (variants, samples, ploidy) bool False False False ... False False
# call_alternate_allele_count (variants, samples) int64 0 1 1 1 1 1 0 1 2 0 ... 1 1 1 1 1 0 2 1 1
# sample_traits (variants, samples, traits) float64 0.1144 -0.2051 ... 0.4461
# sample_covariates (variants, samples, covariates) float64 0.273 0.1751 ... 0.5335
# Attributes:
# contigs: [0, 1]
There are two pretty big warts here:
- We need a function to count alternate alleles: https://github.com/pystatgen/sgkit/issues/282
- The dimension name for phenotypes needs to be unified: https://github.com/pystatgen/sgkit/issues/597