PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

Question about adding genotypes when writing a new VCF

Open hdashnow opened this issue 8 years ago • 8 comments

Hi, apologies for submitting a question as an issue. I wasn't quite sure where to ask it (I couldn't find a mailing list, so please point me to it if I've just missed it!).

I'm writing VCF files from scratch using PyVCF. I've worked out how to do this (mostly). The thing I can't work out how to do from the docs, is add the genotype field.

Here's the code I'm using to add a position to the vcf.

record = vcf.model._Record(CHROM=chrom, POS=start, ID='.', REF=ref_sequence,
            ALT=[vcf.model._Substitution(allele1), vcf.model._Substitution(allele2)],
            QUAL='.', FILTER='PASS', INFO={'RU':repeatunit},
            FORMAT='.', sample_indexes=[], samples=None)
vcf_truth.write_record(record)

And here's an example line of vcf output (note SAMPLE field is blank, as expected since I said samples=None)

##fileformat=VCFv4.1
##source=generate_stutter_vcfs.py
##reference=ucsc.hg19.fasta
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat Unit">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference Length of Repeat">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr3    63898360    .   GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG    GGCAGCAGCAGCAGCAGCAGCAG,GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG   .   PASS    RU=GCA  .

Where I'd like the genotype field to contain 1/2, so it should look like this:

##fileformat=VCFv4.1
##source=generate_stutter_vcfs.py
##reference=ucsc.hg19.fasta
##INFO=<ID=RU,Number=1,Type=String,Description="Repeat Unit">
##INFO=<ID=RL,Number=1,Type=Integer,Description="Reference Length of Repeat">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr3    63898360    .   GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG    GGCAGCAGCAGCAGCAGCAGCAG,GGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG   .   PASS    RU=GCA  GT  1/2

In case it's relevant, note that I'm writing indels, specifically microsatellites (so there could be more than two alt alleles, but I'm just trying to solve the simple case of two alt alleles at this stage).

Any advice would be appreciated!

Thanks, Harriet

hdashnow avatar Jun 23 '16 02:06 hdashnow

Support for writing is definitely immature, but #82 may be of interest to you.

martijnvermaat avatar Jun 23 '16 20:06 martijnvermaat

I've read through #82 and notice there is an old PR #136, the combination of which is leading me to believe that this functionality currently does not exist (at least not in master). Is that correct?

Are you suggesting I use this branch: https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data?

At this stage, it's probably going to be easier for me to do this myself, since I really don't want to be depending on a functionality that is in such flux. Let me know if you get something merged into master.

hdashnow avatar Jul 05 '16 04:07 hdashnow

I ran into the same issue when I wrote a script to filter individual calls(samples) within a variant, using the FT annotation. I could not figure out how to add to the FORMAT for a variant. In the end, I just used a separate script to read the vcf as a text file, and add FT to the format field that way.

I think it would be a really nice addition to pyvcf to be able to add to the FORMAT annotations, instead of everyone having their own little hacks to work around this.

Redmar-van-den-Berg avatar Jul 27 '16 20:07 Redmar-van-den-Berg

@hdashnow : I am having a similar problem. Can you please share, 'how you worked out your problem of updating/writing the values in the Sample field?"

everestial avatar May 11 '17 16:05 everestial

Hi @everestial. Gosh, that was a while ago. I think I just went back to creating VCFs manually.

hdashnow avatar May 12 '17 01:05 hdashnow

Would it be possible for you to share your method? I am able to the INFO field and description, but cannot figure out how to add new fields in FORMAT and it's corresponding value in SAMPLE field. The discussed add_field for _Call isn't available.

everestial avatar May 12 '17 02:05 everestial

I mean I literally printed lines of VCF to output. I didn't use PyVCF for the output at all. Sorry, I can't help.

hdashnow avatar May 12 '17 02:05 hdashnow

No worries !

everestial avatar May 16 '17 16:05 everestial