Ben Jeffery issues

Results 116 issues of


Ben Jeffery

VCF parsing memory usage

I'm attempting to parse some large VCFs. Initial attempts failed due to dask worker memory exhaustion. Will detail results of investigation here.

Prevent parts of VCF being reparsed

Fix https://github.com/pystatgen/sgkit-publication/issues/35 by telling dask that it can release a future for part of a VCF parse when it is complete. This prevents re-parsing when a worker is restarted.

Utilities to help with and/or automatically select VCF ingestion parameters

Currently when converting large VCFs to sgkit it is hard to predict the dask worker RAM usage and to subsequently tweak the `zarr_to_vcf` chunk length parameters to balance RAM usage...

`gwas_tutorial.ipynb` taking too long to run.

CI Is currently failing (e.g. https://github.com/pystatgen/sgkit/actions/runs/3251065111/jobs/5362259020) as the GWAS tutorial notebook is timing out. (default timeout is 30s, I've been running locally for a 5min and it is still going)...

Unused alleles being `""` is confusing for downstream tools - consider `None`?

`tsinfer` needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire...

Comparing datasets - remapping alleles.

We often have the use case of comparing two or more datasets where the corresponding sites have different allele mappings. For example, the ordering could be different, or one dataset...

Docs have floating top bar on `latest`?

Here's a screenshot of `https://pystatgen.github.io/sgkit/latest/`: ![Screenshot from 2023-01-16 14-47-44](https://user-images.githubusercontent.com/8552/212705894-cc9eb052-7a00-41c7-b43c-49f164cf8aec.png) See the "Skip to main content" bar, doesn't go away if you click it either.

documentation

Ben Jeffery