sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

How to parse an entire VCF file into a dataframe.

Open abalter opened this issue 3 years ago • 1 comments

Before posting this issue I searched the entire StackExchange universe and Biostars for questions about this package and there were none. So I decided I would get any help on fora.

Sgkit and its predecessor scikit-allel have a host of wonderful features for filtering, exploring, and annotating VCF files. That's not what I'm after. I want to simply import an entire VCF file into a dataframe with a separate column for each field included in the header. So if INFO has ADP,WT,HET, etc. and FORMAT has GT,GQ,SPD,DP,RD, etc. I want those each just put in their own column.

Is there a simple command that will do that?

For extra points, it would be great to create a table from the header with the schema with the columns name, type, description.

abalter avatar Mar 02 '22 08:03 abalter

Hi @abalter - sgkit converts VCF files to Zarr format, which can then be opened as Xarray. So it's not a Pandas dataframe, but it should be possible to convert from Xarray to Pandas if you need to (see https://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_dataframe.html).

You can convert the VCF to Zarr by calling vcf_to_zarr and specifying fields for the "extra" VCF fields you want to convert. By default only the fixed VCF fields and GT are loaded. The spec has details of how VCF fields are mapped to Zarr.

Once you have a Zarr file on disk, you can open it in sgkit using load_dataset. There are a few examples in https://pystatgen.github.io/sgkit/latest/vcf.html.

Hope that helps.

tomwhite avatar Mar 02 '22 16:03 tomwhite