PyVCF
PyVCF copied to clipboard
walk_together and multiple records with the same position
Hi, thanks for putting together such a wonderful library. Could you please take a look into this issue?.
From VCF spec:
POS position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS....
Taking into account this I believe walk_together() is not working as expected. See this example:
a.vcf:
chr1 1 . A T...
chr1 1 . A C...
chr1 2 . C G...
b.vcf:
chr1 1 . A T...
chr1 2 . C G...
emtis:
(a.record1, b.record1) # expected
(a.record2, None) # expecting (a.record2, b.record1) instead
(a.record3, b.record2) # expected
Do you agree with this logic? Would you consider a pull request implementing this behavior?
BTW, I ran into this issue while implementing set theory operations on VCFs using this library. I'll be happy to also submit this code is you think can be useful.
Thanks, Carlos
Hi Carlos,
How do you handle the case where both a.vcf and b.vcf have A->T and A->C mutations, return the product?
Does the latest version of the walk_together key function enable this?
I be interested to see the set code - what does it do?
Thanks, James
Hi James,
I haven't actually implemented this yet. I'm getting familiar with the walk_together algorithm and wanted to know if this is something you might consider for inclusion. I believe if both files have the second mutation, you should get (a.r1, b.r1) follow by (a.r2, b.r2), is this what you mean by returning the product?
Regarding the set code, again I just started, this is basically what I have so far:
def find_vcf_set(vcf_reader_a, vcf_reader_b, operation='union'):
"""
Calculated set theory operations over two VCF files.
@param vcf_reader_a: PyVCF reader object.
@param vcf_reader_b: PyVCF reader object.
@param operation: Set theory operation.
@return: List of PyVCF records.
"""
result = []
for records in walk_together(vcf_reader_a, vcf_reader_b):
if operation in ['union', 'intersection']:
raise ValueError("Operation not implemented yet: '{0}'".format(operation))
elif operation == 'difference':
if records[0] and not records[1]:
result.append(records[0])
else:
raise ValueError("Operation not supported: '{0}'".format(operation))
return result
I'm sure this can be vastly improve. For example it should support more than two sets. I'll be happy to advance it as much as possible before requesting a pull.
Thanks, Carlos
Forgot to mention, I think the key function should already work if you consider as uniques two lines for the same position with different mutations. My application does not.
No, I mean... (a.r1, b.r1), (a.r2, b.r2), (a.r1, b.r2) (a.r2, b.r1)
It follows from your first suggestion, right?
I see. It doesn't matter for the difference operation, I don't know about union and intersection. I haven't thought about it yet.
For the difference. {r1, r1, r2} \ {r1, r2} = {} {r1, r1, r2} \ {r1, r2, r3} = {r3}
Using walk_together I can find the difference by only adding a record if m - n = 1
with m being the numbers readers and n the numbers of None walk_together returns.