PyVCF icon indicating copy to clipboard operation
PyVCF copied to clipboard

Potential error in vcf.model._Record._set_start_and_end

Open kotoroshinoto opened this issue 9 years ago • 7 comments

in vcf.model._Record._set_start_and_end, I notice that the first line initialized the affected start and end thus: self.affected_start = self.affected_end = self.POS

and then after doing the start/end calculations in a zero-based index manner follows that up with: self.affected_start = min(self.affected_start, start) self.affected_end = max(self.affected_end, end)

wouldn't the self.affected_start & self.affected_end still be in a 1-based coordinate state (based on the value of POS) when the min & max functions run while the start/end variables have been computed using a zero-based coordinate system?

kotoroshinoto avatar Jan 05 '16 20:01 kotoroshinoto

Keep in mind that for multi-base variants, the first base of REF and ALT will be the same (i.e., unchanged), so the affected region should not always include the REF base.

I think the code works as advertised. Can you think of concrete example where it doesn't?

By the way, you can find some more discussion on handling missing ALT alleles in the corresponding pull request #161.

martijnvermaat avatar Jan 05 '16 22:01 martijnvermaat

I don't know the format specifications as well as you do, I assume the specification for multi-base variants states that the first base should be the same?

Jumping between 0 and 1 gets tricky. I'll trust that you know what you're doing, it just seemed like it was comparing across different indexing systems. (if you intended for the effect to effectively be +1, then its all good)

kotoroshinoto avatar Jan 05 '16 23:01 kotoroshinoto

If i'm looking to compare the VCF to entries in a MAF file, I assume I ought to be using the affected_start and affected end?

Should I be cutting the first bases off for the MNV records then? Are there equivalent affected_ref and affected version of the ALT?

kotoroshinoto avatar Jan 05 '16 23:01 kotoroshinoto

I'm not sure what a MAF file is, but the affected_start and affected_end fields give you exactly the complete region (zero-based, open-ended) on the reference that is affected (by all alternative alleles combined). There currently is no shortcut to get the affected region by only one of the alternative alleles.

martijnvermaat avatar Jan 06 '16 15:01 martijnvermaat

Its a format used by the TCGA project. It has some similarities to all the other annotation data types but also has some sample information and columns for base calls in the reference AND matched normals.

kotoroshinoto avatar Jan 06 '16 22:01 kotoroshinoto

Luckily the VCF file from COSMIC doesn't appear to contain any multi-ALT entries, so that simplified my life a bit.

kotoroshinoto avatar Jan 06 '16 22:01 kotoroshinoto

TY for your response and assistance thus far.

kotoroshinoto avatar Jan 06 '16 22:01 kotoroshinoto