error with bambu for a big file
My bam file was mapped with minimap2 using the -ax splice option against the reference genome of C. elegans.
When trying to quantify the data I get the following error:
bambu(reads = bam.raw, annotations = bambuAnnotations, genome = fa.file, ncore = 10)
Start generating read class files
|======================================================================| 100%
Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error:
Error: Assigned data `vec_slice(y_out, y_slicer)` must be compatible with existing data.
✖ Existing data has 0 rows.
✖ Assigned data has 2852609386 rows.
ℹ Only vectors of size 1 are recycled.
In addition: Warning message:
In bambu.processReadsByFile(bam.file = reads[bamFileName], genomeSequence = genomeSequence, :
24 reads are mapped outside the provided genomic regions. These reads will be dropped. Check you are using the same genome used for the alignment
I am not sure about the warning as well, as i have the same reference genome in both runs.But the error keep breaking my run. any ideas what this can be?
thanks
> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP3
Matrix products: default
BLAS/LAPACK: /fs/gpfs41/lv07/fileset03/home/b_cox/yeroslaviz/miniconda3/envs/R/lib/libopenblasp-r0.3.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocParallel_1.30.3 bambu_2.2.0
[3] BSgenome_1.64.0 rtracklayer_1.56.1
[5] Biostrings_2.64.0 XVector_0.36.0
[7] SummarizedExperiment_1.26.1 Biobase_2.56.0
[9] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
[11] IRanges_2.30.0 S4Vectors_0.34.0
[13] BiocGenerics_0.42.0 MatrixGenerics_1.8.1
[15] matrixStats_0.62.0 BiocManager_1.30.18
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lattice_0.20-45 tidyr_1.2.0
[4] prettyunits_1.1.1 png_0.1-7 Rsamtools_2.12.0
[7] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
[10] BiocFileCache_2.4.0 R6_2.5.1 RSQLite_2.2.15
[13] httr_1.4.3 pillar_1.8.0 zlibbioc_1.42.0
[16] rlang_1.0.4 GenomicFeatures_1.48.3 progress_1.2.2
[19] curl_4.3.2 data.table_1.14.2 blob_1.2.3
[22] Matrix_1.4-1 stringr_1.4.0 RCurl_1.98-1.7
[25] bit_4.0.4 biomaRt_2.52.0 DelayedArray_0.22.0
[28] compiler_4.2.0 pkgconfig_2.0.3 tidyselect_1.1.2
[31] KEGGREST_1.36.3 tibble_3.1.8 GenomeInfoDbData_1.2.8
[34] codetools_0.2-18 XML_3.99-0.9 fansi_1.0.3
[37] crayon_1.5.1 dplyr_1.0.9 dbplyr_2.2.1
[40] rappdirs_0.3.3 GenomicAlignments_1.32.1 bitops_1.0-7
[43] grid_4.2.0 jsonlite_1.8.0 lifecycle_1.0.1
[46] DBI_1.1.3 magrittr_2.0.3 cli_3.3.0
[49] stringi_1.7.6 cachem_1.0.6 xml2_1.3.3
[52] filelock_1.0.2 ellipsis_0.3.2 vctrs_0.4.1
[55] generics_0.1.3 xgboost_1.6.0.1 rjson_0.2.21
[58] restfulr_0.0.15 tools_4.2.0 bit64_4.0.5
[61] glue_1.6.2 purrr_0.3.4 hms_1.1.1
[64] parallel_4.2.0 fastmap_1.1.0 yaml_2.3.5
[67] AnnotationDbi_1.58.0 memoise_2.0.1 BiocIO_1.6.0
I tested if splitting the bam file would help. I did this with bamtools split by chromosome and now it works. any ideas why (or what is) the size limit?
This is not the solution. Somehow working on specific chromosomes doesn't go through. I split chrI into six pieces. They should be small enough to run, but I et a similar error:
Error: BiocParallel errors
1 remote errors, element index: 1
0 unevaluated and other errors
first remote error: Assigned data `vec_slice(y_out, y_slicer)` must be compatible with existing data.
✖ Existing data has 206946517 rows.
✖ Assigned data has 4501913813 rows.
ℹ Only vectors of size 1 are recycled.
I'm not even sure, if this is a bambu or biocparallel error.
Hi,
Sorry for the delayed response. As you suspect, I do not think this is an issue with the size but rather the alignments on chrI. From the error message alone it is hard for me to say what exactly could be the issue so I will recommend a few very general troubleshooting things first before delving deeper.
I know you said you used the same genome for both runs but could I get you to please triple check that the genome used for aligning is identical to what is passed into bambu. Can you also check for the presence of ChrI in the genome file. I looked at GCF_000002985.6_WBcel235_genomic.fa (which is a reference genome for C.elegans I found, but could be different from the one you are using) and this did not have ChrI and the genomes were instead named like NC_003279.8
If these are both fine then could you send the chrI bam file you generated and the genome/annotation file to me so that I can try and run it myself which will allow me to track down the cause of the error more easily.
Hi @yeroslaviz ,
following up on Andre's suggestion, you may also try to do the following,
- do discovery and quantification separately,
- set a specific value to yieldSize, which controls the read in of bam files, say 1,000,000 yieldSize = 1000000, so that the memory usage is a bit controlled.
- use a limited number of cpus, when you fail with a certain number of cpus, you can always lower the cpu number and try again.
Try these solutions out and let us know if the problem remains
You can also refer issue #278 for more details on how to process big samples with bambu