pybedtools icon indicating copy to clipboard operation
pybedtools copied to clipboard

VCF feature interval length handling

Open cfljam opened this issue 8 years ago • 3 comments

Im interested in the handling of indels in VCF files . I have created a bedtool from VCF that includes an indel with reference length 18 but the interval has length property 1. This seems inconsistent-have I missed something in the (excellent ) documentation?

targetBed[1].length

1

targetBed[1].fields

['CHR1',
 '4183',
 '.',
 'CGAAAGACAGAAAGACAA',
 'CGAAAGACAGAAAGACAGAAAGACAA',
 '333.339',
 '.',
 'AB=0.4;ABP=7.35324;AC=1;AF=0.0384615;AN=26;AO=20;CIGAR=1M8I17M;DP=710;DPB=888.611;DPRA=0.909091;EPP=3.44459;EPPR=24.6295;GTI=0;LEN=8;MEANALT=2;MQM=60;MQMR=59.9226;NS=13;NUMALT=1;ODDS=31.2879;PAIRED=1;PAIREDR=0.984825;PAO=157.5;PQA=4854.5;PQR=4854.5;PRO=157.5;QA=596;QR=23390;RO=659;RPP=6.91895;RPPR=13.7161;RUN=1;SAF=12;SAP=4.74748;SAR=8;SRF=342;SRP=5.06974;SRR=317;TYPE=ins;technology.Illumina=1',
......

Intersect is as expected

targetBed.intersect(BedTool('CHR1 4187 4188',from_string= True))[0]
Out[43]:
Interval(CHR1:4183-4184)

cfljam avatar Mar 02 '17 17:03 cfljam

The VCF parser in pybedtools is pretty bare-bones: https://github.com/daler/pybedtools/blob/master/pybedtools/cbedtools.pyx#L671

However the calls out to bedtools should not be affected by this.

I think I'd like to move to proper VCF parsing with e.g. cyvcf the same way I moved to proper SAM/BAM parsing using pysam, but this will be a long-term project since it will take a lot of refactoring.

daler avatar Mar 02 '17 18:03 daler

FWIW, the pysam folks have made a lot of improvements on their VCF parser, which may be helpful if you do decide to refactor.

kyleabeauchamp avatar Mar 02 '17 18:03 kyleabeauchamp

Ah, thanks @kyleabeauchamp, good to know.

daler avatar Mar 02 '17 18:03 daler