sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations

Open tnguyengel opened this issue 2 years ago • 11 comments

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ annotations with error:

ValueError: INFO field 'CSQ' is defined as Number '.', which is not supported.

as tested on sgkit v0.6.0.

Presumably, the method will also fail for any VCFs containing annotations with unbounded size. INFO/CSQ contains variant effect predictions from VEP. There can be multiple predictions for each allele, one for every transcript that an allele overlaps. Each prediction is separated by a comma. The number of predictions per allele is not known in advance, and so the INFO/CSQ field is defined with unbounded size in the header, or "Number=.":

For example:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|...>

It would be very useful to be able to filter a zarr for variants that are deemed clinically relevant according to annotation, such as loss of function variants.

Do you suggest any workarounds in the meantime?

tnguyengel avatar Apr 05 '23 20:04 tnguyengel

Hi @tnguyengel, I'm not familiar INFO/CSQ annotations but you can specify a maximum bound for a VCF field using the field_defs parameter.

from sgkit.io.vcf import vcf_to_zarr
vcf_to_zarr(
    "example.vcf.gz",
    "example.zarr",
    fields=["INFO/CSQ "],
    field_defs={"INFO/CSQ ": {"Number": 100}},  # or whatever a reasonable maximum is.
)

Essentially, the vcf_to_zarr function needs to know the maximum size of each field so that it can allocate space. If the field is unbounded ('.'), then you will need to manually provide an upper bound.

timothymillar avatar Apr 05 '23 22:04 timothymillar

Thanks! That sounds like a quick workaround!

From brief googling, a human gene has ~4 transcripts on average, but can have up to ~1000 transcripts. If this is correct, in order to ensure that the INFO/CSQ field is not truncated in the zarr, I would need to set:

field_defs={"INFO/CSQ ": {"Number": 1000}}

Assuming that a typical variant allele does indeed have ~4 INFO/CSQ entries, but can have up to ~1000 INFO/CSQ entries, would I run the risk of bloating the zarr file size due to the INFO/CSQ field? Or does sgkit use sparse representation to handle fields with many zero/null/empty elements?

tnguyengel avatar Apr 06 '23 08:04 tnguyengel

Good questions @tnguyengel! Any thoughts @tomwhite?

jeromekelleher avatar Apr 06 '23 08:04 jeromekelleher

This is a case where fixed-size arrays struggle, I'm afraid. The problem is that in Zarr you can have ragged arrays, or variable-length strings, but not both.

There's more discussion about this in #634 and the linked xarray issue at https://github.com/pydata/xarray/issues/4285 (which seems to have some newer discussion). The good news is that there are lots of people who want to solve this issue.

In the meantime, you can either just store the first n entries (which is lossy), or store all of them (which wastes space).

Another trick is to use the undocumented zarr_array_sizes function to find the maximum number of entries in a VCF. It will parse the VCF twice, but this might be acceptable to find the max number of transcripts in a particular VCF:

from sgkit.io.vcf.vcf_reader import zarr_array_sizes
kwargs = zarr_array_sizes(path)
vcf_to_zarr(path, ..., **kwargs)

tomwhite avatar Apr 06 '23 10:04 tomwhite

Looks like we could improve the error message here?

benjeffery avatar Apr 06 '23 11:04 benjeffery

Looks like we could improve the error message here?

I've opened #1064 for this.

Also, @benjeffery reminded me that the zarr_array_sizes function does not run in parallel, so it will be slow on large datasets. There is an open issue to remedy that here: #734

tomwhite avatar Apr 18 '23 09:04 tomwhite

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

tomwhite avatar Apr 24 '23 16:04 tomwhite

BTW to filter the dataset, you can do something like:

variant_ids = ["rs6054257", "rs6040355"]
ds_filtered = ds.isel(variants=(ds.variant_id.isin(variant_ids)))

tomwhite avatar Apr 25 '23 08:04 tomwhite

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

Thanks! The annotations were executed awhile ago and only output to VCF. Json is admittedly a much easier format to query than VCF. At the moment, we're playing around with splitvep output. If it's not too computationally intensive, reannotating to json might be more attractive.

We'll give the zarr_array_sizes approach a shot too!

tnguyengel avatar Apr 25 '23 11:04 tnguyengel

We'll give the zarr_array_sizes approach a shot too!

The parallel version of zarr_array_sizes has just been merged so you might like to try that by installing from the GitHub main branch.

tomwhite avatar Apr 25 '23 12:04 tomwhite

@tnguyengel Adding a username ping here in case the last message didn't reach your notifications.

benjeffery avatar Apr 28 '23 13:04 benjeffery