sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Concat VCF's in a sample-wise manner

Open AlksIDo opened this issue 3 years ago • 6 comments

Would it be possible to concatenate multiple VCF files into a single Zarr store, preserving sample information, with sgkit?

As an example, I got two files, vcf1, where sample1 is stored, and vcf2, where sample2 is stored.

The following code:

vcf_to_zarr(['path/to/vcf1', 'path/to/vcf2'], "path/to/storage")

Will generate a storage with only one sample instead of two.

ds = sg.load_dataset("path/to/storage")
ds['sample_id'].compute()
>>> array(['sample1'], dtype=object)

The same result would be obtained if I first convert vcf to zarrs and concatenate zarrs

While I am aware that VCF can be concatenated with bcftools merge first, but for me that is not an option. I wonder if this operation can be performed only using sgkit.

AlksIDo avatar Apr 12 '22 10:04 AlksIDo

Hi @AlksIDo - thanks for the question. In the case of multiple VCF files, they are expected to have the same samples (in fact, the same headers), so using vcf_to_zarr in the way you showed won't work.

I think the best bet would be to convert the VCFs to separate Zarr stores, then use xarray methods to combine them into a single dataset.

Do your VCF files all have the same variants in them?

tomwhite avatar Apr 14 '22 14:04 tomwhite

Hello, @tomwhite! A VCF files could contain both different and identical variants. For instance, it could be patients from any of the TCGA cohorts.

Example VCF_1:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
chr1    14458   .       G       .       .       PASS    AC=0;AN=2       GT      0/0
chr1    14464   .       A       T       110     PASS    AC=1;AN=2       GT      0/1
chr1    14465   .       G       .       .       PASS    AC=0;AN=2       GT      0/0

VCF_2:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE2
chr1    14464   .       A       T       110     PASS    AC=1;AN=2       GT      0/1
chr1    14465   .       G       .       .       PASS    AC=0;AN=2       GT      0/0
chr1    14676   .       G       .       .       PASS    AC=0;AN=2       GT      0/0
chr1    14677   .       G       A       85      PASS    AC=1;AN=2       GT      0/1

Which method is the best to use here? According to my knowledge, xr.merge will not work since two datasets must have the same size to use it

AlksIDo avatar Apr 14 '22 18:04 AlksIDo

You could try the following:

  1. convert each VCF into a separate Zarr store
  2. open each dataset using sgkit.load_dataset
  3. set the index on each dataset to contig and position (see code in https://pystatgen.github.io/sgkit/latest/how_do_i.html#subset-to-a-genomic-range)
  4. do an outer join using xr.merge with join=outer

tomwhite avatar Apr 15 '22 14:04 tomwhite

4. do an outer join using xr.merge with join=outer

On this step I got the following error: ValueError: cannot reindex or align along dimension 'variants' because the index has duplicate values

Is there a simple way to fix this problem or do duplicates have to be avoided?

AlksIDo avatar Apr 18 '22 11:04 AlksIDo

As far as I know duplicates need to be removed to set an index for merging.

tomwhite avatar Apr 20 '22 11:04 tomwhite

I put together a small example to do this here: https://github.com/tomwhite/sgkit/blob/vcf-merge-example/merge-vcfs.ipynb.

Also, this note from @timothymillar is related: https://github.com/pystatgen/sgkit/discussions/940#discussioncomment-3957761

tomwhite avatar Nov 01 '22 12:11 tomwhite