sgkit
sgkit copied to clipboard
Error using vcf_to_zarr with HLA contigs
HLA contigs have colons and dashes in their names and I believe this isn't addressed by the vcf_to_zarr implementation, specifically the get_region_start function:
https://github.com/pystatgen/sgkit/blob/d08feba59415fa502150ad1d052e5377cdb83a94/sgkit/io/vcf/vcf_reader.py#L94-L100
This causes a "too many values to unpack" exception since HLA contigs can be, for example, HLA-A*01:01:01:01. I'm not super familiar with the sgkit codebase but the fix may be as simple as changing the implementation of get_region_start to something like this:
import re
...
def get_region_start(region: str) -> int:
"""Return the start position of the region string."""
# check that region ends with start and end coordinates
if not re.match("\d+-\d+$", region):
return 1
contig, start_end = region.rsplit(":", 1)
start, end = start_end.split("-")
return int(start)
This change fixed the errors in my case and I submitted a PR #883.