sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Windowing fails for datasets where contigs lack variant sites

Open percyfal opened this issue 1 year ago • 5 comments

The function sgkit.window_by_position fails if the input dataset has contigs that lack variant sites, throwing an IndexError.

Traceback
IndexError                                Traceback (most recent call last)
Cell In[36], line 1
----> 1 ds = sgkit.window_by_position(ds, size=10)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:218, in window_by_position(ds, size, step, offset, variant_contig, variant_position, window_start_position, merge)
    214 positions = ds[variant_position].values
    215 window_start_positions = (
    216     ds[window_start_position].values if window_start_position is not None else None
    217 )
--> 218 return _window_per_contig(
    219     ds,
    220     variant_contig,
    221     merge,
    222     _get_windows_by_position,
    223     size,
    224     step,
    225     offset,
    226     positions,
    227     window_start_positions,
    228 )

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:320, in _window_per_contig(ds, variant_contig, merge, windowing_fn, *args, **kwargs)
    318 contig_window_stops = []
    319 for i, contig in enumerate(get_contigs(ds)):
--> 320     starts, stops = windowing_fn(
    321         contig, contig_bounds[i], contig_bounds[i + 1], *args, **kwargs
    322     )
    323     contig_window_starts.append(starts)
    324     contig_window_stops.append(stops)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:429, in _get_windows_by_position(contig, start, stop, size, step, offset, positions, window_start_positions)
    426 contig_pos = positions[start:stop]
    427 if window_start_positions is None:
    428     # TODO: position is 1-based (or use offset?)
--> 429     pos_starts = np.arange(offset, contig_pos[-1] + offset, step=step)
    430     window_starts = np.searchsorted(contig_pos, pos_starts) + start
    431     window_stops = np.searchsorted(contig_pos, pos_starts + size) + start

IndexError: index -1 is out of bounds for axis 0 with size 0

This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.

Minimum working example

The VCF below is a modified version of the file sgkit/tests/data/sample.vcf where I added the contig key to the header:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##contig=<ID=19,length=1000>
##contig=<ID=20,length=40000>
##contig=<ID=21,length=10000>
##contig=<ID=X,length=1000>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
19	111	.	A	C	9.6	.	.	GT:HQ	0|0:10,15	0|0:10,10	0/1:3,3
19	112	.	A	G	10	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
20	4370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	7330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3:.,.
20	10696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4:.,.
20	30237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:.:56,60	0|0:48:4:51,51	0/0:61:2:.,.
20	34567	microsat1	G	GA,GAC	50	PASS	NS=3;DP=9;AA=G;AN=6;AC=3,1	GT:GQ:DP	0/1:.:4	0/2:17:2	./.:40:3
20	35237	.	T	.	.	.	.	GT	0/0	0|0	./.
X	10	rsTest	AC	A,ATG,C	10	PASS	.	GT	0	0/1	0|2

Convert to Zarr

bgzip sample.extrachrom.vcf 
tabix sample.extrachrom.vcf.gz
vcf2zarr convert sample.extrachrom.vcf.gz sample.extrachrom.vcz

and run code below to reproduce

import sgkit
import numpy as np

ds = sgkit.load_dataset("sample.extrachrom.vcz")
ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")
ds = sgkit.window_by_position(ds, size=10)

percyfal avatar Oct 10 '24 10:10 percyfal

Thanks for the bug report @percyfal

jeromekelleher avatar Oct 11 '24 08:10 jeromekelleher

The obvious workaround here is to exclude contigs without variants. I tried to do so on an existing Zarr datastore, but it wasn't evident how to update a coordinate and write it back to the store, since that would require restructuring chunks. I ended up loading the VCF files with an updated VCF header that excludes contigs without variants.

percyfal avatar Dec 05 '24 09:12 percyfal

Should that be an option in vcf2zarr @percyfal?

jeromekelleher avatar Dec 05 '24 10:12 jeromekelleher

Yes, the thought has occurred to me. This would presumably be done at the explode stage - basically keep track of the number of variants per contig, then filtering out those with 0 variants? As you may guess, this is the spruce dataset. I ended up recalculating the coordinates from the input VCFs as well, since I couldn't figure out how to update the contig dimension in an existing Zarr archive.

percyfal avatar Dec 05 '24 10:12 percyfal

It's a common thing to have loads of contigs in the header with no variants. We could keep track of the number of variants for each contig during explode, and then have the option of filtering out unreferenced contigs during encode.

jeromekelleher avatar Dec 05 '24 10:12 jeromekelleher