PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

add/edit call data

Open alimanfoo opened this issue 12 years ago • 18 comments

I'd really like a way to add and edit format fields to the sample call data when writing a VCF file. Specifically, what I'm looking to do is add the 'FT' format field, then fill it in for each call. Any suggestions for a workaround much appreciated.

alimanfoo avatar Jan 09 '13 16:01 alimanfoo

It is a bit of a hack, but currently I use something like this (this is for INFO, but I assume you could do something similar for FORMAT):

from random import random
from vcf.parser import _Info as VcfInfo, field_counts as vcf_field_counts
import vcf

reader = vcf.Reader(open('original.vcf'))

reader.infos['EXAMPLE_INFO'] = VcfInfo(
    'EXAMPLE_INFO', vcf_field_counts['A'], 'Float',
    'Example info value for each allele')

writer = vcf.Writer(open('with_example_info.vcf', 'w'), reader,
                    lineterminator='\n')

for record in reader:
    record.add_info('EXAMPLE_INFO', [random() for _ in record.ALT])
    writer.write_record(record)

It's not pretty, but I think in general modifying and writing VCF files is not covered well (yet) in PyVCF.

martijnvermaat avatar Jan 09 '13 16:01 martijnvermaat

Its not easy at the minute. For performance reasons, we store the data from each cell in a namedtuple (see https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py#L529 ). These are immutable, and hence its hard to add data to a CallData.

There are two ways I can imagine doing this off the top of my head:

  1. add a mode that allows you to return mutable CallData objects (using dict or something).
  2. add a method to CallData to add a field

Any opinions, patches on these approaches welcome ;)

jamescasbon avatar Jan 14 '13 11:01 jamescasbon

Hi Martijn,

Thanks for the reply, I've been using that pattern to add INFO fields to the header and INFO values to variants and it's working well. The problem I have is that I now want to modify the actual call data, not just add a new FORMAT field to the header, and as James says the call data is currently immutable because it's stored in a named tuple.

alimanfoo avatar Jan 15 '13 12:01 alimanfoo

Checkout this branch... https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data

In [1]: import vcf
In [2]: r = vcf.Reader(open('vcf/test/miseq.vcf'))
In [3]: c = r.next().samples[0]
In [4]: c.add_field('FOO', 'BAR')
In [5]: c
Out[5]: Call(sample=Sample1, CallData(GT=0/1, AD=[244, 6], DP=250, GQ=99.0, PL=[374, 0, 3216], VF=0.024, FOO=BAR))

So it works, but no effort is made to maintain ordering or consistency between the fields expected from the record and in the call data. Thoughts on a suitable API?

jamescasbon avatar Jan 16 '13 12:01 jamescasbon

@alimanfoo ?? Any feedback?

jamescasbon avatar Jan 20 '13 20:01 jamescasbon

Hi James, apologies, I've had my head in another problem. Thanks for scuba quick response, I'll try out your patch ASAP and get back to you re the API.

On 20 Jan 2013, at 20:26, James Casbon [email protected] wrote:

@alimanfoo https://github.com/alimanfoo ?? Any feedback?

— Reply to this email directly or view it on GitHubhttps://github.com/jamescasbon/PyVCF/issues/82#issuecomment-12476793.

alimanfoo avatar Jan 20 '13 21:01 alimanfoo

Suggested branch changes in https://github.com/jamescasbon/PyVCF/tree/wip/82-add-call-data are not going to be included in the master branch and future releases?? Being able to add unexistent FORMAT fields is a quite interesting fact that can easily be performed with the mentioned changes in model.py. At least I'm using them in this way, for including allele frequencies in the format fields:

            c = record.samples[index]
            try:
                c.add_field('FREQ', 'bar')
            except ValueError:
                pass

The issue is still open, any plans to close it? Meanwhile, the "easiest" way to do it is to create a new record copying all information from the parsed one, create a new namedtuple (named CallData), create new values for that and add modified values to the new tuple. In the end one sample.data = newcalldata. I had to use it to modify some values, like this:

# vcf_parser is a VCFReader
# vcf_writer is a VCFWriter
import collections
import copy

f_keys = vcf_parser.formats.keys() #it's an ordered dict
record = vcf_parser.next()
new_record = copy.deepcopy(record)
for sx in range(len(record.samples)):
  new_record.samples[sx].data = collections.namedtuple('CallData', f_keys)
  f_vals = [record.samples[sx].data[vx] for vx in range(len(f_keys))]
  handy_dict = dict(zip(f_keys, f_vals))
  for f in ['EQ', 'SQ', 'NQ', 'LQ', 'RQ']: #some fields i wanted to modify
    handy_dict[f] = [record.samples[sx][f][0]]
  if handy_dict['GT'] == '2': # I also wanted to modify GT string here
    handy_dict['GT'] = '0'
  new_vals = [handy_dict[x] for x in f_keys]  
  # finally set CallData
  new_record.samples[sx].data = new_record.samples[sx].data._make(new_vals)
vcf_writer.write_record(del_record)

It's ugly, but at least it works. I'm not sure if the deepcopy is appropriate here

dawe avatar May 14 '14 15:05 dawe

@dawe : You can check the open pull request in #136 . I'm currently dealing with most of the common vcf edit tasks (at least those affecting my daily work) in the fork related to the pull request (https://github.com/gitanoqevaporelmundoentero/PyVCF). I'm actively using this fork until it gets merged with the official project.

The issue hasn't been resolved yet, so let's keep this one open. It contains some useful links and context and can still be used for general discussion.

The specific implementation in #136 can be discussed there.

martijnvermaat avatar May 14 '14 18:05 martijnvermaat

I think this is easily solvable by using recordtype as a drop-in replacement for namedtuple: https://pypi.python.org/pypi/recordtype

jdidion avatar Aug 10 '14 14:08 jdidion

I've read the discussion above and in #136 but am still unclear -- is there currently a way to add/edit genotype fields?

daniellieber-claritas avatar Mar 05 '15 03:03 daniellieber-claritas

Since I didn't get any feedback on the open pull request in https://github.com/jamescasbon/PyVCF/pull/136, I continued developing the add/edit functions on a separate and private repository, which is currently working. In next weeks I'll open a new pull request with my changes, and I hope that this time we will get this PR merged...

Hi, I am deciding if I should use PyVCF to convert my proprietary tab separated text file to a .vcf file. I was planning on using PyVCF to make VCF record objects from my data then use the VCF writer object to write the VCF records to a file. Can I do this with the current version of PyVCF? Thank you for your help.

jaebird123 avatar Feb 10 '17 22:02 jaebird123

@martijnvermaat @jamescasbon @alimanfoo @gitanoqevaporelmundoentero

Not, sure why but I am still having a problem:

c = record.samples[2] 
c.add_field('foo', 'bar')

Traceback (most recent call last): File "/home/everestial007/PycharmProjects/stitcher/pHASE-Stitcher-Markov/markov_final_test/phase_to_vcf.py", line 111, in c.add_field('foo', 'bar') AttributeError: '_Call' object has no attribute 'add_field'

I also tried another method:

In [3]: c = next(r).samples[0]
In [4]: c.add_field('FOO', 'BAR')

and still getting the same kind of AtrributeError

everestial avatar May 11 '17 15:05 everestial

@jamescasbon @martijnvermaat

I have to add that there is no add_field method in the file module.py. https://github.com/jamescasbon/PyVCF/blob/master/vcf/model.py

Any suggestions on how I can use add_field method???

everestial avatar May 11 '17 16:05 everestial

I do it like this, you can play with it a little bit.

from collections import namedtuple

for record in reader:
        new_gt = '2/2'
        new_CallData = namedtuple('CallData', samp_fmt._fields)
        calldata = [new_gt] + list(record.samples[0].data[1:])
        record.samples[0].data = new_CallData(*calldata)

tomwitz avatar Feb 21 '21 17:02 tomwitz

@tomwitz this works for me. just make sure to add samp_fmt = sample.data._fields

healthcare-mikeli avatar Sep 15 '22 02:09 healthcare-mikeli