sgkit issues

Use Xarray's `apply_ufunc` rather than Dask's `map_blocks`

1

I've been thinking about how we could run (parts of) sgkit on Cubed (#908). One thing that would help is using [`xarray.map_blocks`](https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html#xarray.map_blocks) (or [`xarray.apply_ufunc`](https://docs.xarray.dev/en/stable/generated/xarray.apply_ufunc.html)) instead of [`dask.array.map_blocks`](https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html), since the Xarray...

tomwhite

dispatching

Single-step VCF to Zarr conversion

When Zarr variable chunking ([ZEP 3](https://zarr.dev/zeps/draft/ZEP0003.html)) is available we would be able to write partitions of a VCF directly into Zarr chunks that vary in size along the variants dimension....

tomwhite

IO

Docs: what is the cost of ``read_chunk_length`` in vcf_to_zare

4

The docs for ``read_chunk_length`` currently say: > Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by...

jeromekelleher

Utilities to help with and/or automatically select VCF ingestion parameters

21

Currently when converting large VCFs to sgkit it is hard to predict the dask worker RAM usage and to subsequently tweak the `zarr_to_vcf` chunk length parameters to balance RAM usage...

benjeffery

IO

Can we use KvikIO?

I noticed that @quasiben has been working on https://github.com/rapidsai/kvikio, which can read Zarr files directly to the GPU. I wonder if we have any workloads that might benefit from this...

hammer

IO

Docs: favicon not set?

1

A tiny usability thing: when I have lots of tabs open it's hard to find the sgkit docs because we don't seem to have a favicon set on the docs...

jeromekelleher

documentation

enhancement

Logging from vcf_to_zarr

2

It would be useful if vcf_to_zarr wrote some debugging information to the Python log. It's a bit of a black box at the moment trying to figure out what's happening...

jeromekelleher

enhancement

Docs: how do I update a dataset on file?

3

I'm not finding any documentation how to do something like: ``` ds = sg.load_dataset(ds_path) ds = sg.count_variant_alleles(ds) # Don't overwrite the whole thing, just write out the new variables so...

jeromekelleher

documentation

`gwas_tutorial.ipynb` taking too long to run.

10

CI Is currently failing (e.g. https://github.com/pystatgen/sgkit/actions/runs/3251065111/jobs/5362259020) as the GWAS tutorial notebook is timing out. (default timeout is 30s, I've been running locally for a 5min and it is still going)...

benjeffery

Unused alleles being `""` is confusing for downstream tools - consider `None`?

12

`tsinfer` needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire...

benjeffery

sgkit
sgkit copied to clipboard

Metadata

Use Xarray's `apply_ufunc` rather than Dask's `map_blocks`

Single-step VCF to Zarr conversion

Docs: what is the cost of ``read_chunk_length`` in vcf_to_zare

Utilities to help with and/or automatically select VCF ingestion parameters

Can we use KvikIO?

Docs: favicon not set?

Logging from vcf_to_zarr

Docs: how do I update a dataset on file?

`gwas_tutorial.ipynb` taking too long to run.

Unused alleles being `""` is confusing for downstream tools - consider `None`?

← Metadata

Owner

Metadata

sgkit sgkit copied to clipboard

Metadata

← Metadata

Owner

Metadata

sgkit
sgkit copied to clipboard