sgkit
sgkit copied to clipboard
Windowing fails for datasets where contigs lack variant sites
The function sgkit.window_by_position fails if the input dataset has contigs that lack variant sites, throwing an IndexError.
Traceback
IndexError Traceback (most recent call last)
Cell In[36], line 1
----> 1 ds = sgkit.window_by_position(ds, size=10)
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:218, in window_by_position(ds, size, step, offset, variant_contig, variant_position, window_start_position, merge)
214 positions = ds[variant_position].values
215 window_start_positions = (
216 ds[window_start_position].values if window_start_position is not None else None
217 )
--> 218 return _window_per_contig(
219 ds,
220 variant_contig,
221 merge,
222 _get_windows_by_position,
223 size,
224 step,
225 offset,
226 positions,
227 window_start_positions,
228 )
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:320, in _window_per_contig(ds, variant_contig, merge, windowing_fn, *args, **kwargs)
318 contig_window_stops = []
319 for i, contig in enumerate(get_contigs(ds)):
--> 320 starts, stops = windowing_fn(
321 contig, contig_bounds[i], contig_bounds[i + 1], *args, **kwargs
322 )
323 contig_window_starts.append(starts)
324 contig_window_stops.append(stops)
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:429, in _get_windows_by_position(contig, start, stop, size, step, offset, positions, window_start_positions)
426 contig_pos = positions[start:stop]
427 if window_start_positions is None:
428 # TODO: position is 1-based (or use offset?)
--> 429 pos_starts = np.arange(offset, contig_pos[-1] + offset, step=step)
430 window_starts = np.searchsorted(contig_pos, pos_starts) + start
431 window_stops = np.searchsorted(contig_pos, pos_starts + size) + start
IndexError: index -1 is out of bounds for axis 0 with size 0
This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.
Minimum working example
The VCF below is a modified version of the file sgkit/tests/data/sample.vcf where I added the contig key to the header:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##contig=<ID=19,length=1000>
##contig=<ID=20,length=40000>
##contig=<ID=21,length=10000>
##contig=<ID=X,length=1000>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
19 111 . A C 9.6 . . GT:HQ 0|0:10,15 0|0:10,10 0/1:3,3
19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3
20 4370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 7330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,.
20 10696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,.
20 30237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,.
20 34567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 ./.:40:3
20 35237 . T . . . . GT 0/0 0|0 ./.
X 10 rsTest AC A,ATG,C 10 PASS . GT 0 0/1 0|2
Convert to Zarr
bgzip sample.extrachrom.vcf
tabix sample.extrachrom.vcf.gz
vcf2zarr convert sample.extrachrom.vcf.gz sample.extrachrom.vcz
and run code below to reproduce
import sgkit
import numpy as np
ds = sgkit.load_dataset("sample.extrachrom.vcz")
ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")
ds = sgkit.window_by_position(ds, size=10)
Thanks for the bug report @percyfal
The obvious workaround here is to exclude contigs without variants. I tried to do so on an existing Zarr datastore, but it wasn't evident how to update a coordinate and write it back to the store, since that would require restructuring chunks. I ended up loading the VCF files with an updated VCF header that excludes contigs without variants.
Should that be an option in vcf2zarr @percyfal?
Yes, the thought has occurred to me. This would presumably be done at the explode stage - basically keep track of the number of variants per contig, then filtering out those with 0 variants? As you may guess, this is the spruce dataset. I ended up recalculating the coordinates from the input VCFs as well, since I couldn't figure out how to update the contig dimension in an existing Zarr archive.
It's a common thing to have loads of contigs in the header with no variants. We could keep track of the number of variants for each contig during explode, and then have the option of filtering out unreferenced contigs during encode.