The best way to append samples to a dataset?
Hi,
I am working with array datasets and wish to concatenate samples across multiple zarr stores (>100). Since these are genotyping array datasets, they only differ in sample dimension. Everything else is identical. Is there a built in optimized function in sgkit to do that? I see in concat_zarrs in the documentation (version 0.6.0) but cannot see the source code and seems deprecated. In version 0.9.0, I see concat_zarrs_optimized but again cannot find it's source code or documentation.
Alternatively, I am simply trying the following:
import xarray as xr
import sgkit as sg
variables_to_concat =['call_GQ','call_IGC','call_LRR','call_NORMX','call_NORMY','call_R','call_THETA','call_X','call_Y','call_genotype','call_genotype_mask','call_genotype_phased','sample_id']
ds = xr.open_mfdataset(dslist,concat_dim = "samples",combine='nested',data_vars=variables_to_concat)
ds = ds.chunk(chunks={"samples":100})
sg.save_dataset(ds,"samples.zarr")
The zarr expects uniform chunk size and the rechunking seems to be expensive. I read the discussions on sgkit and see that you already encountered the issue. I wanted to check if there is an optimized function within sgkit to do such concatenation.
Highly appreciate any suggestions or pointers.
Thank you.
Hi @rajwanir :wave:
I don't think we have a built-in method for doing this - the methods you refer to here were part of the old VCF conversion code, which is now deprecated in favour of vcf2zarr.
I don't know about doing this with xarray/sgkit, but this should be easy enough to do with the low-level Zarr APIs, and it is core functionality that we want to support.
Any thoughts @tomwhite?
This should be done at the Zarr-level, but concatenating >100 stores is not something we have tried yet.
The work I've done on Cubed might be helpful here: you could try calling concat() on the arrays to concatenate them in the samples dimension (axis=1). I would try this out running on a local machine first: https://cubed-dev.github.io/cubed/user-guide/executors.html#local-single-machine-executors.