ORFik icon indicating copy to clipboard operation
ORFik copied to clipboard

Out of Memory error shiftFootPrint function

Open josefinaperalba opened this issue 1 year ago • 14 comments

Hi! I'm running the following code with some big bam files

bam_file <- fimport(
    opt$bam
)


offsets <- ORFik::detectRibosomeShifts(
    footprints = bam_file,
    txdb = opt$ref_db
    #minCDS = 20,
    #top_tx = 100
)

bam_file_shifted <- ORFik::shiftFootprints(
    footprints = bam_file,
    shifts = offsets
)

export.ofst(
    bam_file_shifted,
    file = opt$out_ofst
)

export.ofst(
    convertToOneBasedRanges(
        bam_file_shifted,
        addScoreColumn = TRUE,
        addSizeColumn = TRUE
    ),
    file = opt$out_ofst_ww
)

But I'me getting the following error 'R_Realloc' could not re-allocate memory (18446744065128005632 bytes). Is there a way to optimize this in a way it doen's consume this much memory? I'm running this in order to use the shifted files with the orfik features function.

Thank you so much!

josefinaperalba avatar May 24 '24 16:05 josefinaperalba

18446744065128005632 bytes = 18 446 744.1 terabytes (18 Million terabytes!)

What are you loading ? :D and at what step does this error happen ?

Roleren avatar May 27 '24 12:05 Roleren

Hi! My bam file is 39 GB and as you mentioned the error it says it failed to reallocate 18446744065128005632 bytes. this is the step that is failing.

offsets <- ORFik::detectRibosomeShifts( footprints = bam_file, txdb = ref_db #minCDS = 20, #top_tx = 100 )

It fails at the fimport step. I have another bam file of 24 GB that worked, is there a way to optimize this?

ghost avatar May 29 '24 14:05 ghost

First thing to try: Dont load the bam file before you insert it into ORFik::detectRibosomeShifts: so change to this:

bam_file <- fimport(
    opt$bam
)
bam_file <- opt$bam

Since R does not copy by reference in functions, it will have 3 versions of the bam file in memory if you load it first, while only 1 if you give path.

Roleren avatar May 30 '24 10:05 Roleren

okay, but in that case I need to load it before the shiftFootprints function right? Because that function doesn't allow a path as an input or is there a more memory efficient way to do it? Doing this before shiftFootprints would work?

txdb <- loadTxdb(ref_db)

cds<-loadRegion(txdb, part = "cds", names.keep = txNames)

bam_file<-fimport(opt$bam, cds)

ghost avatar May 31 '24 19:05 ghost

True, I will update the function to support file path input, will do some tests on how much memory this saves :)

Roleren avatar Jun 07 '24 18:06 Roleren

Awesome! thank you so much!! Sorry for the multiple questions, but I'm also working in aligning riboseq reads with STAR and I was wondering what is the strategy of this package to deal with the multimapper. I saw in the functions STAR.align of the package that the default value for this parameter max.multimap is 10, there is a logic behind this value? what strategy is recommended?

ghost avatar Jun 10 '24 11:06 ghost

10 is good for human default, depends on species, but don't touch it unless you have a good reason :)

Roleren avatar Jun 10 '24 13:06 Roleren

Pseudo genes that are translated is a very common multimapper and microRNAs mapping to 3' trailers

Roleren avatar Jun 10 '24 13:06 Roleren

BTW, could you do one thing first, I need to see if it fails here or not:

ORFik:::convert_bam_to_ofst(df)

Then run using the ofst files you get as output. The point is that these are collapsed and compressed to fst, I always make collapsed ofst of the bam (BEFORE) I input into the detectRibosomeShifts function.

Since bam files is a very old format, it should only be loaded into R once, that is when calling convert_bam_to_ofst, after that only use bam to load custom flags (like multimapper flag etc).

Especially when you have a 39GB bam file, which is unusually large.

Roleren avatar Jun 11 '24 07:06 Roleren

Hi! Thanks for all your repsonses! This strategy of converting to ofst before shifting worked for all the files smaller than 30GB so I'll just work with that limit. Thank you so much again!

ghost avatar Jun 13 '24 15:06 ghost

So to complete the biggest file you can do this trick (If the error is loading in the full 39GB bam):

# Goal: Read in 2 halfs, collapse duplicates, save 2 files, read in both, collapse the merge of these two and save out again.
library(ORFik)
df <- ORFik.template.experiment.zf() # Use your experiment here
bam_paths <- filepath(df, "default")
index_of_bigest_file <- which.max(file.size(bam_paths))
big_bam_path <- bam_paths[index_of_bigest_file]
number_of_reads_in_bam <- length(unlist(scanBam(big_bam_path, param = ScanBamParam(what = "qname")),
                      use.names = FALSE))

bamfile <- BamFile(bam, yieldSize=ceiling(number_of_reads_in_bam/2))
all_temp_files <- c()
open(bamfile)
while (length(chunk0 <- readGAlignments(bamfile))) {
  ofst_big_path_temp <- tempfile(fileext = ".ofst")
  message(ofst_big_path_temp)
  load_half_bam <- collapseDuplicatedReads(chunk0, # Read half file
                                           addScoreColumn = TRUE)
  export.ofst(load_half_bam, ofst_big_path_temp)
  rm(load_half_bam) # remove subset
  all_temp_files <- c(all_temp_files, ofst_big_path_temp)
}
close(bamfile)

# Merge and export
# Sadly, ofst does not support appending files on drive, so we need to read in both halfs and save again
subsets <- lapply(all_temp_files, function(path) {
  return(fimport(path))
})
lengths(subsets) # New lengths

ofst_basename <- paste0(ORFik:::remove.file_ext(big_bam_path, TRUE), ".ofst")
ofst_big_path <- file.path(dirname(big_bam_path), "ofst", ofst_basename)
combined <- collapseDuplicatedReads(sort(unlist(GAlignmentsList(subsets))), 
                                    addScoreColumn = TRUE)
number_of_reads_in_bam - sum(lengths(subsets)) # reads collapsed in chunks
sum(lengths(subsets)) - length(combined) # reads collapsed final vs chunks

export.ofst(combined, ofst_big_path) # Now you have the correct OFST

Roleren avatar Jun 14 '24 11:06 Roleren

Hi! Sorry to bother but I saw that you commited an update where you fixed the bug in Shiftfootprints and I wanted to check if that already makes it possible for the function to accept a path as input. The reason I ask is because I've tried running it again with the path and the error I get comes from the ReadWidth function saying: Error in readWidths(footprints, along.reference = TRUE) : reads must be either GRanges, GAlignments, GAlignmentPairs or covRleList I'm not sure if the problem is how I'm installing the package and that maybe I'm not using the latest version or if the commit I saw is not the definite solution.

I'm installing the package in a docker image which Dockerfile is the following:

# Use the latest Rocker R image
FROM r-base:latest

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libcurl4-openssl-dev \
    libssl-dev \
    libxml2-dev \
    libz-dev \
    liblzma-dev \
    libbz2-dev \
    libpng-dev \
    libgit2-dev \
    procps

# Install CRAN packages
RUN install2.r -e \
    BiocManager \
    optparse \
    tidyr \
    parallel \
    dplyr \
    remotes

# Install Bioconductor packages
RUN Rscript -e 'BiocManager::install(c("GenomicFeatures", "GenomicAlignments"), ask=F);'

# Install ORFik package from Bioconductor
RUN Rscript -e 'remotes::install_github("Roleren/ORFik")'

Thank you so much!

ghost avatar Jun 25 '24 21:06 ghost

Hey, try the newest commit, I documented it and should now work as you intended :)

Roleren avatar Jun 26 '24 10:06 Roleren

Hi! everything is working as expected but just wanted to ask a question about how the code works. If I would downsize my bam files, lets say for example I only use certain regions of interest to filter the bam file, would the code work anyway? this would make sense? What if I only do it in bigger bam files where the function crash (bigger than 25gb)?

ghost avatar Jul 22 '24 13:07 ghost

Hello, sorry missed this message.

Do you have a 25gb bam file ? I.e. you have not collapsed duplicate reads then ?

What I usually do is to collapse duplicate reads before alignment (use ORFik::collapse.fastq()) then you will be able to load the whole file I believe as it will go from 25gb to ~ 0.2gb I guess

Roleren avatar Sep 20 '24 16:09 Roleren

Let me know how it works, will close issue on Monday if there was nothing else :)

Roleren avatar Sep 25 '24 19:09 Roleren