sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Scalable genetics toolkit

Results 243 sgkit issues
Sort by recently updated
recently updated
newest added

I've been thinking about how we could run (parts of) sgkit on Cubed (#908). One thing that would help is using [`xarray.map_blocks`](https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html#xarray.map_blocks) (or [`xarray.apply_ufunc`](https://docs.xarray.dev/en/stable/generated/xarray.apply_ufunc.html)) instead of [`dask.array.map_blocks`](https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html), since the Xarray...

dispatching

When Zarr variable chunking ([ZEP 3](https://zarr.dev/zeps/draft/ZEP0003.html)) is available we would be able to write partitions of a VCF directly into Zarr chunks that vary in size along the variants dimension....

IO

The docs for ``read_chunk_length`` currently say: > Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by...

Currently when converting large VCFs to sgkit it is hard to predict the dask worker RAM usage and to subsequently tweak the `zarr_to_vcf` chunk length parameters to balance RAM usage...

IO

I noticed that @quasiben has been working on https://github.com/rapidsai/kvikio, which can read Zarr files directly to the GPU. I wonder if we have any workloads that might benefit from this...

IO

A tiny usability thing: when I have lots of tabs open it's hard to find the sgkit docs because we don't seem to have a favicon set on the docs...

documentation
enhancement

It would be useful if vcf_to_zarr wrote some debugging information to the Python log. It's a bit of a black box at the moment trying to figure out what's happening...

enhancement

I'm not finding any documentation how to do something like: ``` ds = sg.load_dataset(ds_path) ds = sg.count_variant_alleles(ds) # Don't overwrite the whole thing, just write out the new variables so...

documentation

CI Is currently failing (e.g. https://github.com/pystatgen/sgkit/actions/runs/3251065111/jobs/5362259020) as the GWAS tutorial notebook is timing out. (default timeout is 30s, I've been running locally for a 5min and it is still going)...

`tsinfer` needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire...