sgkit
sgkit copied to clipboard
Consider write path for IO libraries
Over at https://github.com/malariagen/vector-data/discussions/22#discussioncomment-590949, @alimanfoo notes that Ag1000G only releases their data as VCF files, and that it might be nice to have the same data in a PLINK-accessible format. Could sgkit handle this conversion?
Hi folks, just to mention that the ability to write data from sgkit-style xarray dataset to plink format has come up again as something that would be very useful to have.
Specific use case currently is wanting to run ADMIXTURE for which there is no Python implementation currently AFAIK.
Just to add, a convenient workflow is to have data in zarr format, then use sgkit and xarray to select samples and variants to be used for the admixture analysis, then export this selection to plink format.
Looks like the bed-reader package has a to_bed() function.
Closing this as work on this is being tracked by #924 and #926.