PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

remove unwanted samples for a given record

Open janedanes opened this issue 12 years ago • 2 comments

Here is what I am doing:

I'm working with a very large vcf file (816 samples). Our experiment has included duplicate samples to try and pick out false positives. The duplicate samples are identified as

Check_1:245098 (unique number is library identifier) Check_1:245012 etc..

I've written a python program that filters out SNP sites where 6/8 duplicates with the same SNP/genotype. I then wrote a set of functions that selects the best duplicate (check_1) for that SNP site. So for one SNP site Check_1:245098 has a higher depth of coverage but for another SNP site Check_1:245012 might have the best depth of coverage. I want to create a consensus Check_1.

I can successfully remove the other 7 duplicates and create a new set of record.samples (809 instead of 816). In this new set of record.samples, I've removed the library identifier and chopped the name to just Check_1.

But I can't figure out how to write this to a file. The writer.write_record doesn't seem to work. It just rewrites all the old duplicates. And I want to write the same record but with a different set of record.samples.

I realize there is a sample filter written but as I am a relative newbie to python I am having trouble understanding how to access the methods (functions) and attributes(variables) in a class.

I would really appreciate any help.

janedanes avatar Feb 21 '13 19:02 janedanes

Hello:

I am having the same issue, and would appreciate some help. I am editing sample names, ie resetting sample.sample after the record is read from file. When I try to use rec.genotype(sample), I am getting a key error, even though I can manually verify that the sample is present in the rec.samples list.

Any advice is much appreciated.

Thanks, Matt

MCowperthwaite avatar Apr 09 '14 20:04 MCowperthwaite

Hi!

This one is quite old, but in case someone needs this from pyVCF, there is actually SampleFilter class (sample_filter.py), and it could be used this way:

import vcf
vcf.SampleFilter(infile=<input_vcf>, filters=<comma-separated-string>, outfile=<output_vcf>)

It will read input VCF with somewhat changed Reader and use Writer to give back filtered VCF. filters param represents unwanted samples.

@jamescasbon Will this class remain in the package, since it is not documented? Are there any downsides to it's usage like this? Thanks.

Cheers

duxan avatar Feb 26 '17 00:02 duxan