sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Consider write path for IO libraries

Open hammer opened this issue 4 years ago • 3 comments

Over at https://github.com/malariagen/vector-data/discussions/22#discussioncomment-590949, @alimanfoo notes that Ag1000G only releases their data as VCF files, and that it might be nice to have the same data in a PLINK-accessible format. Could sgkit handle this conversion?

hammer avatar Apr 24 '21 13:04 hammer

Hi folks, just to mention that the ability to write data from sgkit-style xarray dataset to plink format has come up again as something that would be very useful to have.

Specific use case currently is wanting to run ADMIXTURE for which there is no Python implementation currently AFAIK.

alimanfoo avatar May 09 '22 09:05 alimanfoo

Just to add, a convenient workflow is to have data in zarr format, then use sgkit and xarray to select samples and variants to be used for the admixture analysis, then export this selection to plink format.

alimanfoo avatar May 09 '22 09:05 alimanfoo

Looks like the bed-reader package has a to_bed() function.

alimanfoo avatar May 09 '22 09:05 alimanfoo

Closing this as work on this is being tracked by #924 and #926.

tomwhite avatar Jan 04 '23 16:01 tomwhite