PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

walk_together and multiple records with the same position

Open CarlosBorroto opened this issue 11 years ago • 5 comments

Hi, thanks for putting together such a wonderful library. Could you please take a look into this issue?.

From VCF spec:

POS position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS....

Taking into account this I believe walk_together() is not working as expected. See this example:

a.vcf:

chr1    1    .    A    T...
chr1    1    .    A    C...
chr1    2    .    C    G...

b.vcf:

chr1    1    .    A    T...
chr1    2    .    C    G...

emtis:

(a.record1, b.record1) # expected
(a.record2, None) # expecting (a.record2, b.record1) instead
(a.record3, b.record2) # expected

Do you agree with this logic? Would you consider a pull request implementing this behavior?

BTW, I ran into this issue while implementing set theory operations on VCFs using this library. I'll be happy to also submit this code is you think can be useful.

Thanks, Carlos

CarlosBorroto avatar Feb 19 '14 14:02 CarlosBorroto

Hi Carlos,

How do you handle the case where both a.vcf and b.vcf have A->T and A->C mutations, return the product?

Does the latest version of the walk_together key function enable this?

I be interested to see the set code - what does it do?

Thanks, James

jamescasbon avatar Feb 19 '14 20:02 jamescasbon

Hi James,

I haven't actually implemented this yet. I'm getting familiar with the walk_together algorithm and wanted to know if this is something you might consider for inclusion. I believe if both files have the second mutation, you should get (a.r1, b.r1) follow by (a.r2, b.r2), is this what you mean by returning the product?

Regarding the set code, again I just started, this is basically what I have so far:

def find_vcf_set(vcf_reader_a, vcf_reader_b, operation='union'):
    """
    Calculated set theory operations over two VCF files.

    @param vcf_reader_a: PyVCF reader object.
    @param vcf_reader_b: PyVCF reader object.
    @param operation: Set theory operation.
    @return: List of PyVCF records.
    """

    result = []

    for records in walk_together(vcf_reader_a, vcf_reader_b):
        if operation in ['union', 'intersection']:
            raise ValueError("Operation not implemented yet: '{0}'".format(operation))
        elif operation == 'difference':
            if records[0] and not records[1]:
                result.append(records[0])
        else:
            raise ValueError("Operation not supported: '{0}'".format(operation))

    return result

I'm sure this can be vastly improve. For example it should support more than two sets. I'll be happy to advance it as much as possible before requesting a pull.

Thanks, Carlos

CarlosBorroto avatar Feb 19 '14 20:02 CarlosBorroto

Forgot to mention, I think the key function should already work if you consider as uniques two lines for the same position with different mutations. My application does not.

CarlosBorroto avatar Feb 19 '14 20:02 CarlosBorroto

No, I mean... (a.r1, b.r1), (a.r2, b.r2), (a.r1, b.r2) (a.r2, b.r1)

It follows from your first suggestion, right?

jamescasbon avatar Feb 19 '14 20:02 jamescasbon

I see. It doesn't matter for the difference operation, I don't know about union and intersection. I haven't thought about it yet.

For the difference. {r1, r1, r2} \ {r1, r2} = {} {r1, r1, r2} \ {r1, r2, r3} = {r3}

Using walk_together I can find the difference by only adding a record if m - n = 1 with m being the numbers readers and n the numbers of None walk_together returns.

CarlosBorroto avatar Feb 19 '14 21:02 CarlosBorroto