sgkit
sgkit copied to clipboard
`gwas_linear_regression` is sensitive to Dataset transposition
Hi, I am encountering an error when calling gwas_linear_regression
on a transposed Dataset
. Given that all the dimensions are labelled in a standardized way, it seems to me that this could be handled more gracefully. Below I reproduce the error using the data from the GWAS tutorial:
This works fine (just the tutorial, skipping QC for brevity):
import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
import pandas as pd
from pathlib import Path
import requests
if not Path("1kg.vcf.bgz").exists():
response = requests.get("https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz")
with open("1kg.vcf.bgz", "wb") as f:
f.write(response.content)
vcf_to_zarr("1kg.vcf.bgz", "1kg.zarr", max_alt_alleles=1,
fields=["FORMAT/GT", "FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
field_defs={"FORMAT/AD": {"Number": "R"}})
ds = sg.load_dataset("1kg.zarr")
ANNOTATIONS_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg_annotations.txt"
df = pd.read_csv(ANNOTATIONS_FILE, sep="\t", index_col="Sample")
ds_annotations = pd.DataFrame.to_xarray(df).rename({"Sample":"samples"})
ds = ds.set_index({"samples": "sample_id"})
ds = ds.merge(ds_annotations, join="left")
ds["call_dosage"] = ds.call_genotype.sum(dim="ploidy")
ds_lr = sg.gwas_linear_regression(
ds, dosage="call_dosage",
add_intercept=True, covariates=[], traits=["CaffeineConsumption"])
This throws an error:
import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
import pandas as pd
from pathlib import Path
import requests
if not Path("1kg.vcf.bgz").exists():
response = requests.get("https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz")
with open("1kg.vcf.bgz", "wb") as f:
f.write(response.content)
vcf_to_zarr("1kg.vcf.bgz", "1kg.zarr", max_alt_alleles=1,
fields=["FORMAT/GT", "FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
field_defs={"FORMAT/AD": {"Number": "R"}})
ds = sg.load_dataset("1kg.zarr")
ANNOTATIONS_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg_annotations.txt"
df = pd.read_csv(ANNOTATIONS_FILE, sep="\t", index_col="Sample")
ds_annotations = pd.DataFrame.to_xarray(df).rename({"Sample":"samples"})
ds = ds.set_index({"samples": "sample_id"})
ds = ds.merge(ds_annotations, join="left")
# ONLY ADDITION OVER BASELINE:
ds = ds.transpose("samples", "variants", ...)
ds["call_dosage"] = ds.call_genotype.sum(dim="ploidy")
ds_lr = sg.gwas_linear_regression(
ds, dosage="call_dosage",
add_intercept=True, covariates=[], traits=["CaffeineConsumption"])
Specifically, the error I encounter is:
ValueError: Chunks do not add up to shape. Got chunks=((1,), (10000, 879)), shape=(1, 284)
This error persists also if I call an intermediate ds.load()
, so the problem is encountered probably somewhere in the dask'ified linear regression calculation itself.
If I call gwas_linear_regression
with ds.transpose("variants", "samples", ...)
everything works fine again.
That is a somewhat obscure error all right @clwgg - do you have suggestions on how this could be handled better?
hmm... so the Q
, Y
and G
matrices go into the linear regression here, and it looks to me as though the only one not "dimension-safe" is the G
matrix generated in _get_loop_covariates
here. So I think just popping a .transpose("variants", "samples", ...)
into _get_loop_covariates
before returning G
should do the trick and everything should be transpose-safe.
I'd be happy to test this though and push a PR if you think this fix would be worthwhile.
It may not be worth having a full fix in there (might get complicated in corner cases and lead to unexpected slow-downs, as I assume transpose will be a big operation), but detecting the problem and throwing a useful error message would definitely be useful.
I vaguely remember discussing trying to enforce dimension order at one point and I do think it's hard to address universally, so I believe we decided against it for the time being (or at least that's what I came away with). I can almost see having the default behavior be to do the transpositions if necessary and throw something like a PerformanceWarning
, but a need to move dimensions around has come up very rarely so I don't see much value in that vs. throwing an error.
It's great to have this logged though, thanks @clwgg! There should definitely be a better error if nothing else, with a description of how to fix the problem. That might slot in nicely with the current Dataset validation functions (e.g. association.py#L202).
Did you see something advantageous in working with the transposed data @clwgg, as opposed to just doing operations by dimension name and leaving the underlying arrays in their required order?
alright, that all makes sense! thanks for the replies! And yeah, I think including dimension order checks in the dataset validation would be great! @eric-czech no, no particular reason to work with transposed data! I stumbled across this pretty much by accident.
Great, sounds like we have a plan. PR very welcome @clwgg!