sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

`gwas_linear_regression` is sensitive to Dataset transposition

Open clwgg opened this issue 2 years ago • 6 comments

Hi, I am encountering an error when calling gwas_linear_regression on a transposed Dataset. Given that all the dimensions are labelled in a standardized way, it seems to me that this could be handled more gracefully. Below I reproduce the error using the data from the GWAS tutorial:

This works fine (just the tutorial, skipping QC for brevity):

import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
import pandas as pd
from pathlib import Path
import requests

if not Path("1kg.vcf.bgz").exists():
    response = requests.get("https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz")
    with open("1kg.vcf.bgz", "wb") as f:
        f.write(response.content)

vcf_to_zarr("1kg.vcf.bgz", "1kg.zarr", max_alt_alleles=1,
            fields=["FORMAT/GT", "FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
            field_defs={"FORMAT/AD": {"Number": "R"}})

ds = sg.load_dataset("1kg.zarr")

ANNOTATIONS_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg_annotations.txt"
df = pd.read_csv(ANNOTATIONS_FILE, sep="\t", index_col="Sample")
ds_annotations = pd.DataFrame.to_xarray(df).rename({"Sample":"samples"})
ds = ds.set_index({"samples": "sample_id"})
ds = ds.merge(ds_annotations, join="left")

ds["call_dosage"] = ds.call_genotype.sum(dim="ploidy")
ds_lr = sg.gwas_linear_regression(
    ds, dosage="call_dosage",
    add_intercept=True, covariates=[], traits=["CaffeineConsumption"])

This throws an error:

import sgkit as sg
from sgkit.io.vcf import vcf_to_zarr
import pandas as pd
from pathlib import Path
import requests

if not Path("1kg.vcf.bgz").exists():
    response = requests.get("https://storage.googleapis.com/sgkit-gwas-tutorial/1kg.vcf.bgz")
    with open("1kg.vcf.bgz", "wb") as f:
        f.write(response.content)

vcf_to_zarr("1kg.vcf.bgz", "1kg.zarr", max_alt_alleles=1,
            fields=["FORMAT/GT", "FORMAT/DP", "FORMAT/GQ", "FORMAT/AD"],
            field_defs={"FORMAT/AD": {"Number": "R"}})

ds = sg.load_dataset("1kg.zarr")

ANNOTATIONS_FILE = "https://storage.googleapis.com/sgkit-gwas-tutorial/1kg_annotations.txt"
df = pd.read_csv(ANNOTATIONS_FILE, sep="\t", index_col="Sample")
ds_annotations = pd.DataFrame.to_xarray(df).rename({"Sample":"samples"})
ds = ds.set_index({"samples": "sample_id"})
ds = ds.merge(ds_annotations, join="left")

# ONLY ADDITION OVER BASELINE:
ds = ds.transpose("samples", "variants", ...)

ds["call_dosage"] = ds.call_genotype.sum(dim="ploidy")
ds_lr = sg.gwas_linear_regression(
    ds, dosage="call_dosage",
    add_intercept=True, covariates=[], traits=["CaffeineConsumption"])

Specifically, the error I encounter is:

ValueError: Chunks do not add up to shape. Got chunks=((1,), (10000, 879)), shape=(1, 284)

This error persists also if I call an intermediate ds.load(), so the problem is encountered probably somewhere in the dask'ified linear regression calculation itself. If I call gwas_linear_regression with ds.transpose("variants", "samples", ...) everything works fine again.

clwgg avatar Jun 29 '22 18:06 clwgg

That is a somewhat obscure error all right @clwgg - do you have suggestions on how this could be handled better?

jeromekelleher avatar Jun 30 '22 15:06 jeromekelleher

hmm... so the Q, Y and G matrices go into the linear regression here, and it looks to me as though the only one not "dimension-safe" is the G matrix generated in _get_loop_covariates here. So I think just popping a .transpose("variants", "samples", ...) into _get_loop_covariates before returning G should do the trick and everything should be transpose-safe. I'd be happy to test this though and push a PR if you think this fix would be worthwhile.

clwgg avatar Jun 30 '22 17:06 clwgg

It may not be worth having a full fix in there (might get complicated in corner cases and lead to unexpected slow-downs, as I assume transpose will be a big operation), but detecting the problem and throwing a useful error message would definitely be useful.

jeromekelleher avatar Jul 01 '22 09:07 jeromekelleher

I vaguely remember discussing trying to enforce dimension order at one point and I do think it's hard to address universally, so I believe we decided against it for the time being (or at least that's what I came away with). I can almost see having the default behavior be to do the transpositions if necessary and throw something like a PerformanceWarning, but a need to move dimensions around has come up very rarely so I don't see much value in that vs. throwing an error.

It's great to have this logged though, thanks @clwgg! There should definitely be a better error if nothing else, with a description of how to fix the problem. That might slot in nicely with the current Dataset validation functions (e.g. association.py#L202).

Did you see something advantageous in working with the transposed data @clwgg, as opposed to just doing operations by dimension name and leaving the underlying arrays in their required order?

eric-czech avatar Jul 01 '22 12:07 eric-czech

alright, that all makes sense! thanks for the replies! And yeah, I think including dimension order checks in the dataset validation would be great! @eric-czech no, no particular reason to work with transposed data! I stumbled across this pretty much by accident.

clwgg avatar Jul 03 '22 15:07 clwgg

Great, sounds like we have a plan. PR very welcome @clwgg!

jeromekelleher avatar Jul 04 '22 08:07 jeromekelleher