Rsamtools icon indicating copy to clipboard operation
Rsamtools copied to clipboard

Bug in scanTabix

Open timoast opened this issue 5 years ago • 4 comments

Hi,

I came across a bug in scanTabix where no data is returned when requesting regions on double-digit chromosomes (ie >chr9). This only appears to be an issue on Windows and when the tabix file is above a certain size.

Here is a tabix file and index that will reproduce the issue. Apologies for the huge file, I tried a downsampling but the bug only seems to occur with the larger file.

Reproducible example:

library(Rsamtools)
library(GenomicRanges)
library(IRanges)

tbx.file <- "fragments.tsv.gz"
range.chr14 <- GRanges(seqnames = 'chr14', ranges = IRanges(start = 99635624, end = 99737861))
tbx <- TabixFile(file = tbx.file)
scanTabix(file = tbx, param = range.chr14)

This code will return data on macOS or linux but an empty vector on windows (I tested on Windows 7 with R 3.6.1 and the current version of Rsamtools).

timoast avatar Jul 24 '19 20:07 timoast

Was this resolved? I'm wondering if perhaps some of the other errors I'm experiencing are related to this (will post those soon).

bschilder avatar Mar 19 '22 14:03 bschilder

I think this is likely an integer overflow on Windows; I wonder if this occurs under the 64-bit build, especially under R-devel? This seems to be a regression introduced when we moved to using Rhtslib, but that transition is now quite old and it seems like the right thing to do is update Rhtslib, and then Rsamtools. Unfortunately, that is likely to be a moderate-to-big project and in the short to intermediate term the solution is likely to use Linux or macOS, e.g., via the Windows subsystem for Linux or, e.g., your local compute cluster or cloud provider.

mtmorgan avatar Mar 19 '22 14:03 mtmorgan

Thanks for the reply @mtmorgan, that's quite understandable.

Along those lines, an intermediate solution might be to use the Bioconductor Docker container, which is Linux-based and includes an Rstudio interface. We use this as a base for most of our Docker containers.

bschilder avatar Mar 19 '22 15:03 bschilder

Reminds me of this Rhtslib Windows-specific bug from 2.5 years ago: https://support.bioconductor.org/p/124568/

Yes Rhtslib still contains HTSlib 1.7 which is lagging 4 years behind the latest HTSlib (version 1.15). Right thing to do at this point would be to update Rhtslib. Maybe that Windows-specific Tabix bug is gone in HTSlib 1.15, hopefully. However, as Martin said, this is a major endeavor. Not before BioC 3.16.

H.

hpages avatar Mar 23 '22 07:03 hpages