methylKit icon indicating copy to clipboard operation
methylKit copied to clipboard

rethink caching approach for tabix files

Open alexg9010 opened this issue 4 years ago • 3 comments

It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ

The problem is caused by the fread.gzipped() function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.

One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.

  • [ ] calculate hashes for the compressed file
  • [ ] ignore present tabix files header
  • [ ] make hash part of cached files name

alexg9010 avatar Jan 26 '21 15:01 alexg9010

this is a crazy bug : ) !! Do you know any alternatives? like delayedMatrix to move away from tabix, last time I checked they didn't have this overlap functionality we use with tabix.

On Tue, Jan 26, 2021 at 4:22 PM Alexander Blume (Gosdschan) < [email protected]> wrote:

It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ

The problem is caused by the fread.gzipped() function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.

One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.

  • calculate hashes for the compressed file
  • ignore present tabix files header
  • make hash part of cached files name

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EPWI5WXA73DZPOLORTS33M3TANCNFSM4WTT7UBQ .

al2na avatar Jan 26 '21 18:01 al2na

The easiest fix for this bug would be to disable the caching of the uncompressed files for now and just overwrite the uncompressed file every time. But you are right, maybe it is time to switch to more supported backends that are externally developed.

alexg9010 avatar Jan 26 '21 20:01 alexg9010

Collecting some ideas:

matter package for Rapid prototyping with data on disk

The matter package is designed with several goals in mind. Like the bigmemory and ff packages, it seeks to make statistical methods scalable to larger-than-memory datasets by utilizing data-on-disk. Unlike those packages, it seeks to make domain-specific file formats (such as Analyze 7.5 and imzML for MS imaging experiments) accessible from disk directly without additional file conversion. It seeks to have a minimal memory footprint, and require minimal developer effort to use, while maintaining computational efficiency wherever possible.

(https://bioconductor.org/packages/3.12/bioc/vignettes/matter/inst/doc/matter.pdf) includes:

  • Principal components analysis for on-disk datasets
  • Linear regression for on-disk datasets

DelayedArray

(https://petehaitch.github.io/BioC2020_DelayedArray_workshop/articles/Effectively_using_the_DelayedArray_framework_for_users.html)

  • could be used with any future on disk backend
  • developed by BioConductor Team

alexg9010 avatar Jan 26 '21 21:01 alexg9010