PyVCF
PyVCF copied to clipboard
Question about adding genotypes when writing a new VCF
Hi, apologies for submitting a question as an issue. I wasn't quite sure where to ask it (I couldn't find a mailing list, so please point me to it if I've just missed it!).
I'm writing VCF files from scratch using PyVCF. I've worked out how to do this (mostly). The thing I can't work out how to do from the docs, is add the genotype field.
Here's the code I'm using to add a position to the vcf.
record = vcf.model._Record(CHROM=chrom, POS=start, ID='.', REF=ref_sequence,
ALT=[vcf.model._Substitution(allele1), vcf.model._Substitution(allele2)],
QUAL='.', FILTER='PASS', INFO={'RU':repeatunit},
FORMAT='.', sample_indexes=[], samples=None)
vcf_truth.write_record(record)
And here's an example line of vcf output (note SAMPLE field is blank, as expected since I said samples=None)
##fileformat=VCFv4.1
##source=generate_stutter_vcfs.py
##reference=ucsc.hg19.fasta
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat Unit">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference Length of Repeat">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr3 63898360 . GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG GGCAGCAGCAGCAGCAGCAGCAG,GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG . PASS RU=GCA .
Where I'd like the genotype field to contain 1/2
, so it should look like this:
##fileformat=VCFv4.1
##source=generate_stutter_vcfs.py
##reference=ucsc.hg19.fasta
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat Unit">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference Length of Repeat">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr3 63898360 . GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG GGCAGCAGCAGCAGCAGCAGCAG,GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG . PASS RU=GCA GT 1/2
In case it's relevant, note that I'm writing indels, specifically microsatellites (so there could be more than two alt alleles, but I'm just trying to solve the simple case of two alt alleles at this stage).
Any advice would be appreciated!
Thanks, Harriet
Support for writing is definitely immature, but #82 may be of interest to you.
I've read through #82 and notice there is an old PR #136, the combination of which is leading me to believe that this functionality currently does not exist (at least not in master). Is that correct?
Are you suggesting I use this branch: https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data?
At this stage, it's probably going to be easier for me to do this myself, since I really don't want to be depending on a functionality that is in such flux. Let me know if you get something merged into master.
I ran into the same issue when I wrote a script to filter individual calls(samples) within a variant, using the FT annotation. I could not figure out how to add to the FORMAT for a variant. In the end, I just used a separate script to read the vcf as a text file, and add FT to the format field that way.
I think it would be a really nice addition to pyvcf to be able to add to the FORMAT annotations, instead of everyone having their own little hacks to work around this.
@hdashnow : I am having a similar problem. Can you please share, 'how you worked out your problem of updating/writing the values in the Sample field?"
Hi @everestial. Gosh, that was a while ago. I think I just went back to creating VCFs manually.
Would it be possible for you to share your method? I am able to the INFO field and description, but cannot figure out how to add new fields in FORMAT and it's corresponding value in SAMPLE field. The discussed add_field for _Call isn't available.
I mean I literally printed lines of VCF to output. I didn't use PyVCF for the output at all. Sorry, I can't help.
No worries !