sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

The best way to append samples to a dataset?

Open rajwanir opened this issue 8 months ago • 2 comments

Hi,

I am working with array datasets and wish to concatenate samples across multiple zarr stores (>100). Since these are genotyping array datasets, they only differ in sample dimension. Everything else is identical. Is there a built in optimized function in sgkit to do that? I see in concat_zarrs in the documentation (version 0.6.0) but cannot see the source code and seems deprecated. In version 0.9.0, I see concat_zarrs_optimized but again cannot find it's source code or documentation.

Alternatively, I am simply trying the following:

import xarray as xr
import sgkit as sg
variables_to_concat =['call_GQ','call_IGC','call_LRR','call_NORMX','call_NORMY','call_R','call_THETA','call_X','call_Y','call_genotype','call_genotype_mask','call_genotype_phased','sample_id']
ds = xr.open_mfdataset(dslist,concat_dim = "samples",combine='nested',data_vars=variables_to_concat)
ds = ds.chunk(chunks={"samples":100})
sg.save_dataset(ds,"samples.zarr")

The zarr expects uniform chunk size and the rechunking seems to be expensive. I read the discussions on sgkit and see that you already encountered the issue. I wanted to check if there is an optimized function within sgkit to do such concatenation.

Highly appreciate any suggestions or pointers.

Thank you.

rajwanir avatar Mar 26 '25 23:03 rajwanir

Hi @rajwanir :wave:

I don't think we have a built-in method for doing this - the methods you refer to here were part of the old VCF conversion code, which is now deprecated in favour of vcf2zarr.

I don't know about doing this with xarray/sgkit, but this should be easy enough to do with the low-level Zarr APIs, and it is core functionality that we want to support.

Any thoughts @tomwhite?

jeromekelleher avatar Mar 27 '25 09:03 jeromekelleher

This should be done at the Zarr-level, but concatenating >100 stores is not something we have tried yet.

The work I've done on Cubed might be helpful here: you could try calling concat() on the arrays to concatenate them in the samples dimension (axis=1). I would try this out running on a local machine first: https://cubed-dev.github.io/cubed/user-guide/executors.html#local-single-machine-executors.

tomwhite avatar Mar 27 '25 09:03 tomwhite