sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Error using vcf_to_zarr with HLA contigs

Open d-laub opened this issue 3 years ago • 0 comments

HLA contigs have colons and dashes in their names and I believe this isn't addressed by the vcf_to_zarr implementation, specifically the get_region_start function:

https://github.com/pystatgen/sgkit/blob/d08feba59415fa502150ad1d052e5377cdb83a94/sgkit/io/vcf/vcf_reader.py#L94-L100

This causes a "too many values to unpack" exception since HLA contigs can be, for example, HLA-A*01:01:01:01. I'm not super familiar with the sgkit codebase but the fix may be as simple as changing the implementation of get_region_start to something like this:

import re
...
def get_region_start(region: str) -> int:
    """Return the start position of the region string."""
    # check that region ends with start and end coordinates
    if not re.match("\d+-\d+$", region): 
        return 1
    contig, start_end = region.rsplit(":", 1)
    start, end = start_end.split("-")
    return int(start)

This change fixed the errors in my case and I submitted a PR #883.

d-laub avatar Aug 03 '22 19:08 d-laub