sgkit
sgkit copied to clipboard
Scalable genetics toolkit
I've been thinking about how we could run (parts of) sgkit on Cubed (#908). One thing that would help is using [`xarray.map_blocks`](https://docs.xarray.dev/en/stable/generated/xarray.map_blocks.html#xarray.map_blocks) (or [`xarray.apply_ufunc`](https://docs.xarray.dev/en/stable/generated/xarray.apply_ufunc.html)) instead of [`dask.array.map_blocks`](https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html), since the Xarray...
When Zarr variable chunking ([ZEP 3](https://zarr.dev/zeps/draft/ZEP0003.html)) is available we would be able to write partitions of a VCF directly into Zarr chunks that vary in size along the variants dimension....
The docs for ``read_chunk_length`` currently say: > Length (number of variants) of chunks to read from the VCF file at a time. Use this option to reduce memory usage by...
Currently when converting large VCFs to sgkit it is hard to predict the dask worker RAM usage and to subsequently tweak the `zarr_to_vcf` chunk length parameters to balance RAM usage...
I noticed that @quasiben has been working on https://github.com/rapidsai/kvikio, which can read Zarr files directly to the GPU. I wonder if we have any workloads that might benefit from this...
A tiny usability thing: when I have lots of tabs open it's hard to find the sgkit docs because we don't seem to have a favicon set on the docs...
It would be useful if vcf_to_zarr wrote some debugging information to the Python log. It's a bit of a black box at the moment trying to figure out what's happening...
I'm not finding any documentation how to do something like: ``` ds = sg.load_dataset(ds_path) ds = sg.count_variant_alleles(ds) # Don't overwrite the whole thing, just write out the new variables so...
CI Is currently failing (e.g. https://github.com/pystatgen/sgkit/actions/runs/3251065111/jobs/5362259020) as the GWAS tutorial notebook is timing out. (default timeout is 30s, I've been running locally for a 5min and it is still going)...
`tsinfer` needs to check the number of alleles at each variant. Currently, the only safe way to do this for a VCF-derived sgkit dataset is to iterate over the entire...