hap.py
hap.py copied to clipboard
Buffer overflow in hap.py’s `preprocess` program
At https://github.com/Illumina/hap.py/blob/6c907ce3b02956bc239022db6edea7c48d6ddb8b/src/c++/lib/variant/VariantReader.cpp#L735, the size allocated for the array ad
is adcount
, whereas in the function https://github.com/Illumina/hap.py/blob/6c907ce3b02956bc239022db6edea7c48d6ddb8b/src/c++/lib/tools/BCFHelpers.cpp#L532
(which gets called at VarientReader:738 with ad
) writes values.size()
elements (
where values
is a vector resembling the elements in the AD
entry of a sample), causing a buffer overflow.
This could either be solved by: exiting on an incorrect AD field, truncating the interpreted AD
fields so that they are of correct length or making the size of AD
be the size of the actualAD
fields (I haven’t actually looked though the code to see how this code path is used and what would be appropriate).
This bug can be shown to happen in the following vcf (I’ve exaggerated the number of AD fields to make problems happen). This should segfault on the next delete []
statement due to heap corruption:
##fileformat=VCFv4.1
##contig=<ID=chr1,length=10000>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
chr1 2 . AAT T 1 PASS . GT:AD 1/1:1,1,1,1,2,3,2,4,1,3,5,5,3,2,12,1,3,3,1
Thanks for the report!
I think the temporary workaround for this would be to use bcftools annotate to remove AD (which isn't used by hap.py).
In the future, the code that uses VariantReader will be retired, but I'll include a fix for this in the next version.