htslib icon indicating copy to clipboard operation
htslib copied to clipboard

int overflow for big variant lines in vcf

Open arnaldurgylfason opened this issue 1 year ago • 4 comments

We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size. bcftools and our own program that uses htslib can not read it. After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c) it resulted in an overflow giving a negative number which was interpreted as failure. Bigger datasets becoming more common in the future this problem will occur more often. I´m sure this overflow problem can be found at more places. I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.

arnaldurgylfason avatar Dec 16 '22 18:12 arnaldurgylfason

See also samtools/bcftools#1614 and #1448.

jmarshall avatar Dec 16 '22 23:12 jmarshall

Thanks. We will start using FORMAT/LPL and FORMAT/LAA for variants with many alleles. I still think htslib should be able to handle bigger lines though

arnaldurgylfason avatar Dec 19 '22 09:12 arnaldurgylfason

It does look like bgzf_getline() has the same overflow problem. It should be easy to fix in the same way as #1448 did for uncompressed reads.

Note that this may not buy you that much due to other limitations in the library, for example bcf_fmt_t::p_len and bcf_fmt_t::p_off. As these are part of the public interface we can't easily change them. I'm also not entirely sure we'd want to as by the time individual records start getting to multi-gigabyte sizes you have to consider if you've reached the limit of what can sensibly be done with flat files, and consider solutions that are more database oriented (for example, hail).

daviesrob avatar Jan 03 '23 10:01 daviesrob

It would be nice to have htslib print a warning and skip such entries. We cannot support such large records fully as that breaks too many things and is not worth it. The feasible solution was proposed above, generate LPL and LAA instead.

pd3 avatar Jan 03 '23 14:01 pd3