bambu icon indicating copy to clipboard operation
bambu copied to clipboard

Large numebr of BAM files leads to Error in `vec_interleave_indices()`:

Open NikoLichi opened this issue 1 year ago • 5 comments

Dear Bambu team,

I am running a massive project with 480 BAM files with ~4.8 TB total data. Following the previous suggestion for Bambu, I am running first the extended annotations (quant = FALSE), with the idea of running the quantification later in batches.

However, the is a major issue when starting the extended annotations:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 8857886400, which is larger than the maximum supported size of 2^31 - 1.
Backtrace:
     ▆
  1. ├─bambu::bambu(...)
  2. │ └─bambu:::bambu.extendAnnotations(...)
  3. │   └─bambu:::isore.combineTranscriptCandidates(...)
  4. │     ├─... %>% data.table()
  5. │     └─bambu:::combineSplicedTranscriptModels(...)
  6. │       └─bambu:::updateStartEndReadCount(combinedFeatureTibble)
  7. │         └─... %>% mutate(sumReadCount = sum(readCount, na.rm = TRUE))
  8. ├─data.table::data.table(.)
  9. ├─dplyr::mutate(., sumReadCount = sum(readCount, na.rm = TRUE))
 10. ├─dplyr::group_by(., rowID)
 11. ├─tidyr::pivot_longer(...)
 12. ├─tidyr:::pivot_longer.data.frame(...)
 13. │ └─tidyr::pivot_longer_spec(...)
 14. │   └─vctrs::vec_interleave(!!!val_cols, .ptype = val_type)
 15. │     └─vctrs:::vec_interleave_indices(n, size)
 16. └─rlang::abort(message = message)
Execution halted

Is there anything I could do to run Bambu?

My code looks like:

BAMlist = BAMs_one_per_Line
fa.file <- "/refData/release46/GRCh38.primary_assembly.genome.fa"
gtf.file <-  "/refData/release46/gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_PATH/bambu_20241015_all")

As an additional note, I also have the same warning message as some others have reported as issue #407 .

This is with R 4.3.2 and Bioc 3.18 / bambu (3.4.1). Platform: x86_64-conda-linux-gnu (64-bit)

All the best, Niko

NikoLichi avatar Oct 17 '24 08:10 NikoLichi

Hello there,

I made some progress with a different approach but bambu is still failing in the same issue.

I divided the data in 3 batches to obtain the extended annotations in the .rds. I set it up like below for each of the 3 batches:

extendedAnnotations = bambu(reads = BAMs_one_per_Line, annotations = bambuAnnotations, genome = fa.file, NDR = 0.1, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR")

As a small note, I tested before with default NDR, and each batch had an NDR > 0.6, but I am working with human samples, so I decided to fix NDR=0.1.

After this, I put together the .rds files for each one of the three bambu runs listing the .rds files for each run in a vector and emerging them like:

B01 <- Myfiles01
B02 <- Myfiles02
B03 <- Myfiles03

B_allRDs <- c(B01,B02,B03)

mergedAnno = bambu(reads = B_allRDs, genome = fa.file, annotations = bambuAnnotations, quant = FALSE)

But then, I got the same error as above and the first lines are:

--- Start extending annotations ---
Error in `vec_interleave_indices()`:
! Long vectors are not yet supported in `vec_interleave()`. Result from interleaving would have size 17715772800, which is larger than the maximum supported size of 2^31 - 1.

I would appreciate any help how to run this massive data set, Best, Niko

NikoLichi avatar Oct 24 '24 08:10 NikoLichi

Hi @NikoLichi ,

sorry for getting back lately, we have been looking into the issue and have pushed a fix in a separate branch: https://github.com/GoekeLab/bambu/tree/patch_bigsamples

where we have rewritten the code relevant to the error.

Could you please help us to try it out and see if it works? If not, could you please help to post the complete error output so that we can investigate further?

For how to use this branch:

devtools::install_github("GoekeLab/bambu", ref = "patch_bigsamples")
library(bambu)

Thank you Kind regards, Ying

cying111 avatar Oct 25 '24 01:10 cying111

Hi @cying111,

Thanks for your reply.

I did two tests with the new installation.

  1. Bambu annotation with all samples at once.
  2. Bambu merging RDS annotation from 3 batches.

Both had the same error, which is a little weird. It did not find a function...

--- Start extending annotations ---
Error in readCountWeightedMedian(.SD, c("start.1", "start.10", "start.100",  :
  could not find function "readCountWeightedMedian"
Calls: bambu ... combineSplicedTranscriptModels -> updateStartEndReadCount -> [ -> [.data.table
Execution halted

Kind regards, NiKo

NikoLichi avatar Oct 28 '24 08:10 NikoLichi

Hi @NikoLichi ,

sorry for the bug, I have pushed a fix to add in the missing function in the patch_bigsamples branch.

Could you help to try again to see if it works?

Update: pushed one more fix for a typo just now, please update your branch to try again. Thanks a lot!

Thank you Warm regards, Ying

cying111 avatar Oct 28 '24 08:10 cying111

Hi @cying111,

Alright, after 1day and 13hours running this is the issue that both of my jobs/approaches are presenting:

--- Start extending annotations ---
Error in `[.data.table`(startEndDt, combinedFeatureTibble[, .(intronStarts,  :
  When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
Calls: bambu ... [ -> [.data.table -> stopf -> raise_condition -> signal
In addition: Warning message:
In `[.data.table`(startEndDt, combinedFeatureTibble[, .(intronStarts,  :
  Ignoring by/keyby because 'j' is not supplied
Execution halted

I hope this helps. Warm regards, NiKo

NikoLichi avatar Oct 30 '24 09:10 NikoLichi

Hi @NikoLichi ,

It's great to hear back and the error means that the changes that I made is working, but there is a small typo in the code of the version that you have used, which I have pushed a fix shortly after I realized it.

Could you run this again to use the most recent version that should have the typo fixed.

devtools::install_github("GoekeLab/bambu", ref = "patch_bigsamples")
library(bambu)

I think this time it should work.

Hope to hear from you soon Thank you Kind regards, Ying

cying111 avatar Nov 01 '24 07:11 cying111

Hi @cying111,

Sorry for not getting to you sooner. I was doing many tests and they took quite some time.

First of all, the current patch does not show the previous error, so it is solved! Thanks for it!

However, I encountered that Bambu is memory-hungry even when using the lowMemory=T option. I would appreciate if you could guide/correct my steps/code here.

I did two tests, first using a reduced data set (1 Million reads per sample) to have the parameters ready for the large data like:

library(bambu)

BAMlist <- paste(BAM_Files_Samples_Vector

fa.file <- "GRCh38.primary_assembly.genome.fa"
gtf.file <-  "gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR/Reduced_1MReads ))

RDsFiles <- list.files("MY_DIR/Reduced_1MReads", pattern = "*.rds", full.names = T)

se.multiSample <- bambu(reads = RDsFiles, annotations = extendedAnnotations, genome = fa.file, ncore = 14, lowMemory=T, NDR=0.3,  discovery =F, quant =T)

This worked quite well. However, I would appreciate it if you could comment on the syntax since I don't get why the read classes are used as reads and what is used in the annotations part.

The second approach, for all the data, I ran bambu in batches (i.e., 3), printing out the read classes and then calling them similar as above: 1st

library(bambu)

BAMlist <- paste(BAM_Files_Samples_Vector

fa.file <- "GRCh38.primary_assembly.genome.fa"
gtf.file <-  "gencode.v46.primary_assembly.annotation.gtf"
bambuAnnotations <- prepareAnnotations(gtf.file)

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, rcOutDir="MY_DIR/3batchs_1 ))

Later, when these were finished, I used again bambu like:

# 1-3 batch RDS list
RDsFiles <- list.files("MY_DIR/3batchs_?", pattern = "*.rds", full.names = T)

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE)

se.multiSample <- bambu(reads = RDsFiles, annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

I have one last question: due to the large use of RAM memory by bambu, do you recommend using for the reads= argument the rds classes or the BAM files in each step?

Sorry for my extensive piece of code and question, but want to be sure we are getting the data processed correctly.

Thanks again! Kind regards, Niko

NikoLichi avatar Nov 12 '24 14:11 NikoLichi

Hi @NikoLichi ,

Thanks for getting back with the good news, I am happy that the patch branch works.

For your first test using subset data, the code all looks good to me.

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, ncore = 14, rcOutDir="MY_DIR/Reduced_1MReads ))

This step actually does two things in one command, it processes bam files and perform junction correction, and then collapse reads to read classes, which are then saved as rds file in the provided rcOutDir, and because discovery is set to TRUE by default, this step also return the extended annotation object which contain both the reference bambuAnnotations and the newly discovered annotations by Bambu

se.multiSample <- bambu(reads = RDsFiles, annotations = extendedAnnotations, genome = fa.file, ncore = 14, lowMemory=T, NDR=0.3, discovery =F, quant =T)

In your second step, then you can just use the preprocesssed rdsfiles, as input, so that bam files no need to be processed again, and extendAnnotations is supplied, for quantification purposes. In this step, NDR=0.3, and lowMemory=T are actually not used, so you can leave them out. The rest are all correct and needed.

For running all data together, I want to clarify with you a few details first:

extendedAnnotations = bambu(reads = BAMlist, annotations = bambuAnnotations, genome = fa.file, NDR = 0.3, quant = FALSE, lowMemory=T, rcOutDir="MY_DIR/3batchs_1 ))

This step for processing bam files in batches, you can actually set discovery = FALSE, in this step, and then it will save you sometime. You can also leave the NDR=0.3 out, so that this step only processes bam files. You can actually do it per sample using a bash mode even, cause, the bam file processing is anyway done per sample. That is, you can just loop through every sample in bash, i.e., 480 batches, instead of just 3 batches, this will help to reduce the memory usage.

For the second parts onwards,

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE) This is the step where you are trying to perform discovery, the syntax is all correct. Is this step sucessfully run?

If so, for the quantification parts,

se.eachsample <- bambu(reads = RDsFiles[i], annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

similarly, I would suggest you to loop through all rdsfiles one by one in bash, so that quantification is done by per sample, which will save your memory a lot. Because the annotations is now mergedAnno the same for all samples, the results can also be easily combined and compared.

In this way, I bebieve you should be able to run it through. Hope this has clarified your questions!

Thank you and let us know if you have any doubts regarding the above and need any help! All the best, Ying

cying111 avatar Nov 13 '24 09:11 cying111

Hi @cying111,

Thank you very much for your input and the details in the commands. It seems I can speed up the process a bit more with your comments!

mergedAnno = bambu(reads = RDsFiles, annotations = bambuAnnotations, genome = fa.file, lowMemory=T, NDR = 0.3, discovery =TRUE, quant = FALSE)

Yes, this step was the bottleneck in R, but using your patch seems to work well. I only got a warning: Detected Bambu derived annotations in the annotations. Set a new prefix with opt.discovery(list(prefix='newPrefix')) to prevent ambigious id assignment.

You mentioned that when running quantification for each sample,

se.eachsample <- bambu(reads = RDsFiles[i], annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

the results can easily be merged. So, in this case, the RangedSummarizedExperiments will be merged with a simple vector, like: se.Allsamples <- c(se.eachsample_1, ..., se.eachsample_480) or ?

All the best, Niko

NikoLichi avatar Nov 13 '24 10:11 NikoLichi

Hi @NikoLichi ,

Thanks for getting back so quickly, and I am happy that the mergedAnno step runs through.

For the warning you see in the mergedAnno step warning is suggesting that the bambuAnnotations contain bambu discovered transcripts already, which should not be the case, as bambu discovery is only happening in this step. Maybe you can double check that bambuAnnotations is directly output of prepareAnnotations of the gtf file?

For the second question, yes, the RangedSummarizedExperiments can be easily combined, just need a bit of tweaks there because of the existence of metadata:

seList <- list(se.eachsample_1,...., se.eachsample_400) incompatibleCounts <- lapply(seList, function(x) metadata(x)$incompatibleCounts)) incompatibleCounts <- Reduce(bambu:::merge_wrapper, incompatibleCounts)

This will keep the incompatible counts information

seAll <- do.call(cbind, lapply(seList, function(x) metadata(x)$incompatibleCounts <- NULL;metadata(x)$warnings <- NULL;return(x)}))

This will combine all the ses.

Let me know if you have questions regarding the above! All the best Warm regards, Ying

cying111 avatar Nov 14 '24 03:11 cying111

Hello @cying111,

I’m encountering the same problem with a large dataset. Each of my samples was pre-processed into separate RDS files using Bambu v3.5.1 (quantification=FALSE, discovery=FALSE). Before attempting to run the patch_bigsamples branch mentioned here, I wanted to ask whether you anticipate any compatibility issues with RDS files generated by Bambu v3.5.1. Specifically:

  1. Could using these pre-processed RDS files with the patch_bigsamples branch affect discovery or quantification results?

  2. Is there any risk of getting results that appear valid but are actually incorrect due to a mismatch in file formats or internal data structures?

I plan to test this setup shortly, but I’m concerned that even if it runs without errors, there might be subtle compatibility problems. Any insights or guidance you can share would be greatly appreciated.

Thank you for your help!

bernardo-heberle avatar Jan 28 '25 13:01 bernardo-heberle

Hello @bernardo-heberle ,

Thanks for reaching out, and sorry for the late reply. I understand your concern about whether the results might be affected, but I can assure you that they remain unchanged. The modifications in this branch are merely a more efficient re-implementation of one function, ensuring it performs the same task while improving efficiency and memory usage.

Hope this clarifies your question! Warm regards, Ying

cying111 avatar Jan 31 '25 02:01 cying111

Thank you for your reply, @cying111.

I tried using the patch_bigsamples branch, but Bambu still exits with an error when I run my large dataset. The error is no longer the usual vec_interleave_indices() issue—instead, I’m seeing a memory exhausted error. This is not expected since I’m running on a node with 4 TB of RAM, and monitoring shows the job only ever peaks at about 400 GB of usage.

Data & Workflow Details:

  • Samples: ~140 samples, each sequenced on a full PromethION flowcell (~60 million primary alignments per sample).
  • Pre-processing: Each sample has been processed into an RDS file with Bambu individually.
  • Current Step (where the issue happens): I’m combining all RDS files to run Bambu with both discovery and quantification.

Parameters for Current Step:

  • lowMemory=TRUE
  • ncores = 1
  • trackReads = FALSE

Additional Context:

  • I’m attaching the verbose Bambu output showing the memory exhaustion error.
  • I’m also including the R scripts used for pre-processing and for the final discovery/quantification step.
  • The job ran for about 36 hours before failing.

Any help or suggestions on resolving this issue would be greatly appreciated. Please let me know if there’s any other information I can provide to assist in debugging.

bambu_memory_exhausted_error.txt

bambu_script_for_individual_sample_pre_processing.txt

bambu_script_for_discovery_and_quantification_using_all_pre_processed_samples.txt

bernardo-heberle avatar Feb 02 '25 17:02 bernardo-heberle

Hi @bernardo-heberle ,

I'm glad the previous error has been resolved, and I appreciate your detailed description of the context—it really helps me understand the situation.

The current error is a common one that occurs when processing multiple samples together for quantification. For your next steps, I recommend splitting the process into two stages:

Combine all RDS files to run transcript discovery (i.e., with quant = FALSE). Once you obtain extendedAnnotations, run quantification separately for each RDS file using the same extendedAnnotations, with discovery = FALSE and quant = TRUE.

This follows the same approach I previously described for the quantification step.

If so, for the quantification parts,

se.eachsample <- bambu(reads = RDsFiles[i], annotations = mergedAnno, genome = fa.file, ncore = 14, lowMemory=T, discovery =F, quant =T)

similarly, I would suggest you to loop through all rdsfiles one by one in bash, so that quantification is done by per sample, which will save your memory a lot. Because the annotations is now mergedAnno the same for all samples, the results can also be easily combined and compared.

In this way, I bebieve you should be able to run it through. Hope this has clarified your questions!

Thank you and let us know if you have any doubts regarding the above and need any help! All the best, Ying

Let me know if you need additional help! Thank you Warm regards, Ying

cying111 avatar Feb 03 '25 02:02 cying111

Hi @cying111,

Thank you for the thoughtful reply, I really appreciate your help!

I tried the approach you suggested (or at least I think I followed your instructions correctly), but unfortunately it still failed with the same error. Here is the detailed description of what I did:

I used the patch_bigsamples branch again, but Bambu still exits with the same memory exhausted error even when setting quant=FALSE. I used a subsample of "only" 64 samples to test this time. I am running it on a node with 500GB of RAM, and monitoring shows the job only ever peaks at about 320 GB of usage.

Data & Workflow Details:

  • Samples: 64 samples, each sequenced on a full PromethION flowcell (~60 million primary alignments per sample).
  • Pre-processing: Each sample has been processed into an RDS file with Bambu individually (samples were individually processed the same way as before).
  • Current Step (where the issue happens): I’m combining all RDS files to run Bambu to perform discovery only.

Parameters for Current Step:

  • quant=FALSE
  • lowMemory=TRUE
  • ncores=1
  • trackReads=FALSE

Additional Context:

  • The job ran for about 26 hours before failing.

  • I attached the verbose Bambu output showing the memory exhaustion error.

  • I also attached the R script used for the discovery step, the sessionInfo() output, and the packageDescription(bambu) output. If it is helpful, you can download the Singularity container I am using to run the analysis with this command:

singularity pull --arch amd64 library://ebbertlab/nanopore_cdna/bambu_low_memory:2025-02-06

bambu_discovery_script_generating_error.txt

bambu_error_discovery_only.txt

sessionInfo().txt

packageDescription(bambu).txt

bernardo-heberle avatar Feb 10 '25 00:02 bernardo-heberle

Hi @bernardo-heberle , Thanks for the updates. It looks like the memory issue is due to an R out of memory error, which may be caused by the use of trackReads during data pre-processing.

You might want to process your RC file to remove readNames and readId, so the read class object remains manageable for further processing. You can use the following code to process each RC file:

rc <- readRDS(rc_file_ith_sample) metadata(rc)$readNames <- NULL metadata(rc)$readId <- NULL saveRDS(rc, file = "rc_sample_id_notrackreads_version.rds")

After this, you can use the notrackreads version of the RDS files in the discovery step.

If the discovery step works after this modification, you can revert to using the original RC file for quantification. However, for this step, you might want to batch process your samples to avoid R memory issues.

Hope this helps! Please let me know if you need any clarifications or further assistance.

Thank you Warm regards, Ying

cying111 avatar Feb 11 '25 09:02 cying111

Brilliant! The code now runs successfully. Thank you very much for the support and quick replies.

I have one small observation: it seems that setting trackReads = FALSE during the discovery step should already remove the trackReads overhead, even if the input RDS files have their reads tracked.

Perhaps you might consider adding these two lines to the RDS file pre-processing when trackReads is set to FALSE. If the final file will not be tracking reads, eliminating the overhead early on would be beneficial (unless I'm missing something):

metadata(rc)$readNames <- NULL
metadata(rc)$readId <- NULL

Once again, thank you for all your excellent work! :)

bernardo-heberle avatar Feb 12 '25 15:02 bernardo-heberle