Feature Request: +setGT plugin ability to set GT in a subset of samples in the VCF file
Use Case:
- multi sample vcf file from prior WGS
- New targeted seq data generated for a subset of the samples and determined to be higher quality (eg higher coverage) than prior genotyping. Eq a high coverage exome or targeted gene seq panel.
- We want to replace selected genotypes/samples in the original vcf file with genotypes from the second. This would be particularly useful for setting ./. in the original to called genotypes. If there are conflicting genotypes provide option to not update those or to allow file 2 to overwrite file GTs.
Easier Enhancement:
- Modify existing +setGT to allow a -s / -S sample list options. If this is set, only apply GT updates to those samples in the list options. This requires processing VCF2 outside to subset to just the GTs you want to update.
More Flexible but probably more complex enhancement:
- Extend +setGT to allow a second vcf.gz file to be input with the first original file. If a second vcf.gz is found, overwrite existing GTs with those GTs in the update file.
The current workaround for this is to convert to PLINK2 format and do with merge. This works well if you intend to use PLINK/PLINK2 and you can convert there but it feels that there should be a way to do this at the early VCF stage to permit use of VCF for other programs and analyses.
Just a correction for accuracy. PLINK2 still doesn't (yet) support general merge, only concatenation type merge. Therefore you have to downgrade to PLINK1.9 bed/bim/fam format for the more flexible merge, BUT as developers point out:
- PLINK1.9 wasn't designed to handle the long complex indel alleles in VCF files
- Doesn't preserve REF ALT without explicit vcf workaround options
- Is slower and has less efficient compressed data structures
So perhaps even more of an argument for BCF-level tool ;-)
I am wondering if bcftools annotate cannot be used for this task already?
If so it is not really clear from the documentation. The doc talks about TAGs but we thought this meant adding or removing TAGs, not updating them and specially not updating GTs. There is no specific mention about how to update GTs using this mechanism. It would be great to know if it could be used for this, and if so, IMO, it should be called out explicitly in the doc with an example since it would be significant piece of functionality to manage and update VCF genotypes that unfortunately is unintentionally buried away.
Thanks very much.
Yeah, there are so many features it's difficult to keep a track of them. Sometimes even for me :)
From the perspective of the program, genotype is just another tag, FORMAT/GT. This page with examples hints on this (http://samtools.github.io/bcftools/howtos/annotate.html, printed on the usage page)
Carry over all INFO and FORMAT annotations except FORMAT/GT
bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
I don't mind extending it. Maybe you could help a little and suggest what example would be clear enough.
(More general comments - in the spirit of trying to be helpful, but I know resources are limited and it is always easier to critique)
The fundamental problem is that the main bcftools page is a reference manual not a user manual. As you have seen, I would not have thought to look in bcftools annotate to do this and stopped at the more obvious setGT plugin. My impression is that the bcftools team are trying to keep the ref manual as terse as possible rather than expansive, again probably because of resource limits.
Presumably the original idea was for this: https://samtools.github.io/bcftools/howtos/index.html
to be a sort of 'user manual' but there have no updates as along as I can remember.
Since GTs are so fundamental and the raison d'être for the vcf/bcf format, functions that directly manage them rather than just decorating them more tags and metadata are really important.
The best user solution for the main reference manual would be to create a new wrapper for the functions you already have written in annotate /setGT with something like:
bcftools gtmanage or getting away from the terse single verb bcftools manage GT
Crystal clear and at the top of the reference page so users can find and jump directly to what they are trying to do rather than spend time figuring exactly which buried settings in the software-code-organized manual may or may not work.
Your specific question: None of the examples in annotate section refer to using annotate for updating GT directly. I would say having an example where you use annotate to directly update genotypes in one vcf/bcf from a second annotate.vcf/bcf. it is not clear to me how to do this using -c. The annotate.bcf file obviously has to have the sample ids. Also what would happen if you also set -k (keep sites) or if you had eg 6 samples in the 1st.bcf file and there were only 3 in your annotate.bcf . Or what would happen if there were the same 6 in the annotate.bcf and if you can specify a subset of those to be updated using -c FMT/GT[sampleid] in some way . I'm also not clear on how you could accomplish the conditional updating - update with annotate.bcf only if the existing GT="./." for example. There are lot of details here that need to be addressed.
Eg:
annotate can be used to update/replace the genotypes (GT) in one file by those in another (annotate.bcf)
bcftools annotate -a src.bcf -c FMT/GT annotate.bcf