Ben Jeffery

Results 116 issues of Ben Jeffery

I'm attempting to parse some large VCFs. Initial attempts failed due to dask worker memory exhaustion. Will detail results of investigation here.

Fix https://github.com/pystatgen/sgkit-publication/issues/35 by telling dask that it can release a future for part of a VCF parse when it is complete. This prevents re-parsing when a worker is restarted.

Currently when converting large VCFs to sgkit it is hard to predict the dask worker RAM usage and to subsequently tweak the `zarr_to_vcf` chunk length parameters to balance RAM usage...

IO

CI Is currently failing (e.g. https://github.com/pystatgen/sgkit/actions/runs/3251065111/jobs/5362259020) as the GWAS tutorial notebook is timing out. (default timeout is 30s, I've been running locally for a 5min and it is still going)...

`tsinfer` needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire...

We often have the use case of comparing two or more datasets where the corresponding sites have different allele mappings. For example, the ordering could be different, or one dataset...

Here's a screenshot of `https://pystatgen.github.io/sgkit/latest/`: ![Screenshot from 2023-01-16 14-47-44](https://user-images.githubusercontent.com/8552/212705894-cc9eb052-7a00-41c7-b43c-49f164cf8aec.png) See the "Skip to main content" bar, doesn't go away if you click it either.

documentation

Due to https://github.com/pydata/xarray/issues/7292 `zarr` arrays with `fill_value` set get their `dtype` changed to `float32`. This causes the dataset to fail when opened with `sgkit`. `fill_value` is used in `zarr` to...

upstream

Fixes #2838 Note that this is a fairly breaking change that we should think about, given that the default is for msprime output to require the new flag to `write_vcf`.