sgkit
sgkit copied to clipboard
Concat VCF's in a sample-wise manner
Would it be possible to concatenate multiple VCF files into a single Zarr store, preserving sample information, with sgkit?
As an example, I got two files, vcf1, where sample1 is stored, and vcf2, where sample2 is stored.
The following code:
vcf_to_zarr(['path/to/vcf1', 'path/to/vcf2'], "path/to/storage")
Will generate a storage with only one sample instead of two.
ds = sg.load_dataset("path/to/storage")
ds['sample_id'].compute()
>>> array(['sample1'], dtype=object)
The same result would be obtained if I first convert vcf to zarrs and concatenate zarrs
While I am aware that VCF can be concatenated with bcftools merge first, but for me that is not an option. I wonder if this operation can be performed only using sgkit.
Hi @AlksIDo - thanks for the question. In the case of multiple VCF files, they are expected to have the same samples (in fact, the same headers), so using vcf_to_zarr in the way you showed won't work.
I think the best bet would be to convert the VCFs to separate Zarr stores, then use xarray methods to combine them into a single dataset.
Do your VCF files all have the same variants in them?
Hello, @tomwhite! A VCF files could contain both different and identical variants. For instance, it could be patients from any of the TCGA cohorts.
Example VCF_1:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr1 14458 . G . . PASS AC=0;AN=2 GT 0/0
chr1 14464 . A T 110 PASS AC=1;AN=2 GT 0/1
chr1 14465 . G . . PASS AC=0;AN=2 GT 0/0
VCF_2:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE2
chr1 14464 . A T 110 PASS AC=1;AN=2 GT 0/1
chr1 14465 . G . . PASS AC=0;AN=2 GT 0/0
chr1 14676 . G . . PASS AC=0;AN=2 GT 0/0
chr1 14677 . G A 85 PASS AC=1;AN=2 GT 0/1
Which method is the best to use here? According to my knowledge, xr.merge will not work since two datasets must have the same size to use it
You could try the following:
- convert each VCF into a separate Zarr store
- open each dataset using
sgkit.load_dataset - set the index on each dataset to contig and position (see code in https://pystatgen.github.io/sgkit/latest/how_do_i.html#subset-to-a-genomic-range)
- do an outer join using
xr.mergewithjoin=outer
4. do an outer join using
xr.mergewithjoin=outer
On this step I got the following error:
ValueError: cannot reindex or align along dimension 'variants' because the index has duplicate values
Is there a simple way to fix this problem or do duplicates have to be avoided?
As far as I know duplicates need to be removed to set an index for merging.
I put together a small example to do this here: https://github.com/tomwhite/sgkit/blob/vcf-merge-example/merge-vcfs.ipynb.
Also, this note from @timothymillar is related: https://github.com/pystatgen/sgkit/discussions/940#discussioncomment-3957761