ATACseqQC Issue: loading bam file using readBamFile uses huge amount of memory causing crashes

Hi, I've been trying to get this QC package to work as part of an ATAC-seq pipeline I've been working on. I have followed the instructions from the ATACseqQC vignette on Bioconductor, and I've been able to get plots, etc. working for individual chromosomes (I am working with Zebrafish danRer10).

However, trying to perform QC for a whole sample (single bam file) for all chromosomes 1-25 results in very heavy memory use (I've run readBamFile on a ~9 gb bam input file, and about 30 minutes later my system is reporting that 38/64gb or 80.5% of available memory is used!!)

Is this intended behaviour? Because I've tried running my pipeline that uses part of ATACseqQC, only to have it crash at the last step because the R process got killed for memory overconsumption. The system isn't exactly tiny with 64 gb RAM, yet it gets close to crashing for just one sample.

I am able to cause this on my system using the following standalone code:

fragSize <- fragSizeDist(bamfile, 'test')
library(BSgenome.Drerio.UCSC.danRer10)
seqlev <- paste0('chr', 1:25)
tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT")
which <- as(seqinfo(Drerio)[seqlev], 'GRanges')
gal <- readBamFile(bamfile, tag = tags, which = which, asMates = T)

My bam file is properly indexed and has a size of 9524 Mb.

I am using version 1.2 of ATACseqQC on R 3.4.3. Please let me know if this has been addressed in the latest version of ATACseqQC.

Thanks Stephen

May 04 '18 07:05 zsteve

Hi Stephen,

Thank you for trying ATACseqQC. Indeed this is a big issue of ATACseqQC. This is on my schedule. Now the function readBamFile will read the whole bam file into the memory. It will be changed in the future.

Jianhong

On Fri, May 4, 2018 at 3:43 AM, Stephen Zhang [email protected] wrote:

Hi, I've been trying to get this QC package to work as part of an ATAC-seq pipeline I've been working on. I have followed the instructions from the ATACseqQC vignette on Bioconductor, and I've been able to get plots, etc. working for individual chromosomes (I am working with Zebrafish danRer10).

However, trying to perform QC for a whole sample (single bam file) for all chromosomes 1-25 results in very heavy memory use (I've run readBamFile on a ~9 gb bam input file, and about 30 minutes later my system is reporting that 38/64gb or 80.5% of available memory is used!!)

Is this intended behaviour? Because I've tried running my pipeline that uses part of ATACseqQC, only to have it crash at the last step because the R process got killed for memory overconsumption. The system isn't exactly tiny with 64 gb RAM, yet it gets close to crashing for just one sample.

I am able to cause this on my system using the following standalone code:

fragSize <- fragSizeDist(bamfile, 'test') library(BSgenome.Drerio.UCSC.danRer10) seqlev <- paste0('chr', 1:25) tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT") which <- as(seqinfo(Drerio)[seqlev], 'GRanges') gal <- readBamFile(bamfile, tag = tags, which = which, asMates = T)

My bam file is properly indexed and has a size of 9524 Mb.

I am using version 1.2 of ATACseqQC on R 3.4.3. Please let me know if this has been addressed in the latest version of ATACseqQC.

Thanks Stephen

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jianhong/ATACseqQC/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYSA-NfWUDeBNYZ_Y4ZBgkYetfnP2AAks5tvAaQgaJpZM4TyOyT .

-- Yours sincerely, Jianhong Ou

May 04 '18 16:05 jianhong

Thanks Jianhong. Just wondering if you'd be able to provide an ETA on this at all? Otherwise the package looks like exactly what we need for our purposes.

May 05 '18 13:05 zsteve

Hi Stephen,

It is hard for me to provide an ETA. But I will do it as soon as possible.

Jianhong.

On Sat, May 5, 2018 at 9:34 AM, Stephen Zhang [email protected] wrote:

Thanks Jianhong. Just wondering if you'd be able to provide an ETA on this at all? Otherwise the package looks like exactly what we need for our purposes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jianhong/ATACseqQC/issues/2#issuecomment-386806086, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYSA-zqaD-JWSrCmoP-IfvxltUPlp0tks5tvapPgaJpZM4TyOyT .

-- Yours sincerely, Jianhong Ou

May 07 '18 12:05 jianhong

Hi Stephen,

I tried to fix the memory issue of ATACseqQC. Please have a try with the newest development version of ATACseqQC to see the changes.

Jianhong.

Oct 05 '18 15:10 jianhong

Hi Jianhong, Thanks for that! Trying it out now.

Oct 08 '18 00:10 zsteve

I have also found this seemingly handy package very memory intense! that makes it almost not applicable for whole genome data!! I particularly tried its 'splitBam' function with a few bam files of size ranging 2-4gp. For each file I asked 100GB of memory, and after a few hours they all failed because the machine had reached its maximum! I just dont understand why this is the case? The version I tried was 1.6.2 that I had installed from here.

Dec 10 '18 14:12 hkoohy

Hi hkoohy, try to run splitBam without providing the conservation argument to see if it still kill the memory. I will try to figure out this issue.

Dec 11 '18 14:12 jianhong

Thanks for your quick reply. This is indeed without conservation argument.

Dec 11 '18 14:12 hkoohy

Did you tried to set bigFile to TRUE and readBamFile like this gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE, bigFile=TRUE)

Dec 11 '18 14:12 jianhong

Thanks again for taking time and looking into this. No, it is suggested in the manual as an option, I was trying to do shifting, splitting and saving in one go and therefore just called splitBam in which I dont see any options for bigFile or asMates. Here is the full code in case it helps: #!/usr/bin/env Rscript

args = commandArgs(trailingOnly=TRUE)

test if there is at least one argument: if not, return an error

if (length(args)==0) { stop("At least one argument must be supplied (full path input bam file).\n", call.=FALSE) } else if (length(args)==1) { this_bam = args[1] print(this_bam) }else{ stop('Too many arguments, dont know what to do with them\n', call. = FALSE) }

library(ATACseqQC) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(phastCons100way.UCSC.hg19)

########### functions ############# get_lables <- function(fname_str){ features <- unlist(strsplit(fname_str ,'_')) return(features[2]) }

################## this_bam <- 'path_to_bam_file.bam' base_name <- basename(this_bam) label <- get_lables(base_name) dir_name <- dirname(this_bam) out_path <- paste(dir_name, '/', label, '_splitted', sep='') dir.create(out_path)

tags <- c("AS", "XN", "XM", "XO", "XG", "NM", "MD", "YS", "YT") txs <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene) genome <- Hsapiens

objs <- splitBam(this_bam, tags=tags, outPath=out_path, txs=txs, genome=genome)

Dec 11 '18 15:12 hkoohy

I got it. I did not put it as an argument for splitBam. Try to change objs <- splitBam(this_bam, tags=tags, outPath=out_path, txs=txs, genome=genome) into gal <- readBamFile(bamfile, tag=tags, which=which, asMates=TRUE, bigFile=TRUE) gal1 <- shiftGAlignmentsList(gal) shiftedBamfile <- file.path(outPath, "shifted.bam") export(gal1, shiftedBamfile) objs <- splitGAlignmentsByCut(gal1, txs=txs, genome=genome) writeListOfGAlignments(objs, outPath)

I will fix this in development version.

Dec 11 '18 16:12 jianhong

Thanks, I will do it tomorrow or shortly after that, and come back to you if there was any problem. thanks again

Dec 11 '18 16:12 hkoohy

you are very welcome. Memory cost is a big issue for this package. I am trying to fix it. However, I don't have a timeline yet.

On Tue, Dec 11, 2018 at 11:31 AM hkoohy [email protected] wrote:

Thanks, I will do it tomorrow or shortly after that, and come back to you if there was any problem. thanks again

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jianhong/ATACseqQC/issues/2#issuecomment-446268109, or mute the thread https://github.com/notifications/unsubscribe-auth/AFYSAwvm9e6SLc9iDdFXP08qgUAVrbx0ks5u393ggaJpZM4TyOyT .

-- Yours sincerely, Jianhong Ou

Dec 11 '18 19:12 jianhong

Hi jianhong, I am afraid I got inform you that this also failed with a machine with 64GB of memory! Perhaps I should look through your code, but what I really don't understand is that, starting with about a 4GB bam file, if it keeps everything in memory, still 64GB of memory should be ok?!! in other words how does it manage to eat this much of memory? its is just reading a sorted bam file with perhaps RSamtools, read it line by line, and then separate them based on their fragment sizes.

Dec 13 '18 08:12 hkoohy

ATACseqQC ATACseqQC copied to clipboard

Issue: loading bam file using readBamFile uses huge amount of memory causing crashes

test if there is at least one argument: if not, return an error

ATACseqQC
ATACseqQC copied to clipboard