Tom White
Tom White
You could try the following: 1. convert each VCF into a separate Zarr store 2. open each dataset using `sgkit.load_dataset` 3. set the index on each dataset to contig and...
As far as I know duplicates need to be removed to set an index for merging.
I put together a small example to do this here: https://github.com/tomwhite/sgkit/blob/vcf-merge-example/merge-vcfs.ipynb. Also, this note from @timothymillar is related: https://github.com/pystatgen/sgkit/discussions/940#discussioncomment-3957761
Cubed implements [`concat`](https://data-apis.org/array-api/latest/API_specification/generated/signatures.manipulation_functions.concat.html), but perhaps xarray needs richer concat functionality than that?
Thanks @honno and @asmeurer. It looks like array-api-compat could be extended to provide a compatibility layer for Dask too? (It could also be used by Dask to provide uniform access...
Closing old issue
Closing this for the time being, can re-open if we see scalability issues.
Closing as this has been addressed
Related: https://www.coiled.io/blog/save-money-with-spot
A different way to approach this problem is to export the fields that need ragged string arrays to a different storage backend, such as Parquet, then use tools to query...