Tom White comments

Results 506 comments of


                                            Tom White

VCF file zoo

Some previous work here: #432

Provide progress bar for `vcf_to_zarr`

We could document how to use Dask's progress bars with `vcf_to_zarr`.

Provide progress bar for `vcf_to_zarr`

> which did not bring up any progress bars. I believe Dask is being invoked since the Dataset returned by this call has multiple chunks, so I'm a bit confused...

Provide progress bar for `vcf_to_zarr`

> Maybe we should fork this out into a separate discussion, so we can make some high-level decisions about how to do logging? Yes, this would be very useful!

Provide progress bar for `vcf_to_zarr`

Hmm just found https://github.com/tqdm/tqdm#dask-integration. Also, I wonder if we can use file position within the VCF file or region as a rough proxy for progress...

LD prune test is failing due to differences with scikit-allel

Thanks @eric-czech! I think that explains the error I posted. I tried setting the range of `threshold` to exclude 0, but I get other failures. ``` diff --git a/sgkit/tests/test_ld.py b/sgkit/tests/test_ld.py...

Fix get_region_start to work with contig names that have colons and dashes.

Thanks for opening an issue and PR to fix it @d-laub! The code looks good to me. Would you be able to add a short unit test of this function,...

Fix get_region_start to work with contig names that have colons and dashes.

It looks like the latest failures are when running tests against real VCF files, not pre-commit failures. It will need some digging to see if that is a problem introduced...

Fix get_region_start to work with contig names that have colons and dashes.

+1 to using `bcftools` for the ground truth here.

Add function to get max sizes of fields from VCF

This exists as `zarr_array_sizes`, but it is not a part of the public API since it runs sequentially. Leaving this issue open to cover the parallel implementation.