MSnbase
MSnbase copied to clipboard
CDF read seems slow
Hello, XCMS user here, finally trying to adapt an older workflow to new XCMS/MSnbase. Most of the files I work with are in .cdf format. Waters Databridge spits data out in this format.
raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, cache = 0)
there are 14 cdf files in this test set. This process took 7.3 hours (I had misformatted my pheno dataframe so it errored at that point, but I assume the 7.3 hours is a representative minimum. I tried using the mzR process in isolation (msdata <- mzR::openMSfile(f, backend = "netCDF")) , and this was as fast as previously experienced. The time consuming part is in the for loop (for (f in files) {} ). I am running this from a solid state local drive, so I can rule out network issues. This is my first go at this, but this seems like an abnormal amount of time to read in the data. If it helps I can post a couple of files. Thanks, Corey
session info:
R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 mzR_2.10.0 Rcpp_0.12.16
[6] BiocParallel_1.10.1 Biobase_2.36.2 BiocGenerics_0.22.1
loaded via a namespace (and not attached):
[1] pillar_1.2.1 compiler_3.4.2 BiocInstaller_1.26.1 RColorBrewer_1.1-2
[5] plyr_1.8.4 iterators_1.0.9 tools_3.4.2 zlibbioc_1.22.0
[9] MALDIquant_1.17 digest_0.6.15 preprocessCore_1.38.1 tibble_1.4.2
[13] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0 Matrix_1.2-11
[17] foreach_1.4.4 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.2
[21] multtest_2.32.0 grid_3.4.2 impute_1.50.1 survival_2.41-3
[25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1
[29] MASS_7.3-47 splines_3.4.2 scales_0.5.0 pcaMethods_1.68.0
[33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2
[37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[41] vsn_3.44.0 affyio_1.46.0
Please use the following to read your raw data in:
raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, mode = "onDisk")
This won't load all the raw data into memory but only access it when necessary. For further details, see this short benchmarking.
I'll close the issue, but feel free to re-open it and ask for clarifications if needed.
Thank you for your interest in the software!
Thanks - I figured it was an easy fix.
raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, mode = "onDisk") Error in readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, : unused argument (mode = "onDisk")
?readMSData help file also does not report 'mode' as an option.
Update from github that has v2.5.12?
Yes, you need to update your MSnbase
installation, and possibly R. What's your sessionInfo()
?
Thanks to yo both!
I will update everything and let you know what happens.
Current session info:
R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 mzR_2.10.0 Rcpp_0.12.16 [6] BiocParallel_1.10.1 Biobase_2.36.2 BiocGenerics_0.22.1
loaded via a namespace (and not attached): [1] pillar_1.2.1 compiler_3.4.2 BiocInstaller_1.26.1 RColorBrewer_1.1-2 [5] plyr_1.8.4 iterators_1.0.9 tools_3.4.2 zlibbioc_1.22.0 [9] MALDIquant_1.17 digest_0.6.15 preprocessCore_1.38.1 tibble_1.4.2 [13] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0 Matrix_1.2-11 [17] foreach_1.4.4 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.2 [21] multtest_2.32.0 grid_3.4.2 impute_1.50.1 survival_2.41-3 [25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1 [29] MASS_7.3-47 splines_3.4.2 scales_0.5.0 pcaMethods_1.68.0 [33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2 [37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11 [41] vsn_3.44.0 affyio_1.46.0
New R, Newest BioC versions of XCMS and MSnbase, same error. I will try to update next from github.
R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 BiocParallel_1.10.1 mzR_2.10.0
[6] Rcpp_0.12.16 Biobase_2.36.2 BiocGenerics_0.22.1 BiocInstaller_1.26.1
loaded via a namespace (and not attached):
[1] pillar_1.2.1 compiler_3.4.4 RColorBrewer_1.1-2 plyr_1.8.4
[5] iterators_1.0.9 tools_3.4.4 zlibbioc_1.22.0 MALDIquant_1.17
[9] digest_0.6.15 tibble_1.4.2 preprocessCore_1.38.1 gtable_0.2.0
[13] lattice_0.20-35 rlang_0.2.0 Matrix_1.2-12 foreach_1.4.4
[17] yaml_2.1.18 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.4
[21] multtest_2.32.0 grid_3.4.4 impute_1.50.1 survival_2.41-3
[25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1
[29] MASS_7.3-49 splines_3.4.4 scales_0.5.0 pcaMethods_1.68.0
[33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2
[37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[41] vsn_3.44.0 affyio_1.46.0
Version incompatibility is preventing me from testing this.
install_github("lgatto/MSnbase") ... Error : package 'mzR' 2.10.0 was found, but >= 2.13.6 is required by 'MSnbase'
install_github("sneumann/mzR") .... ERROR: dependency 'Rhdf5lib' is not available for package 'mzR'
install.packages("Rhdf5lib") ... package ‘Rhdf5lib’ is not available (for R version 3.4.4)
...drops head onto keyboard....
@cbroeckl I just did the same as you over Easter with similar head banging. No worries though it gets better. Try also installing Rhdflib5 and mzR from github.
What I did was installing them in the order of Rhdflib, mzR, and finally MSNbase. And be aware that R might re-compile the dependent packages when installing the next one, also a bit frustrating but it “should” work in the end.
Sorry for repeat comments, I’m on my iphone... one more warning is that you might also need to install the development version of R for this to work. I basically re-installed several versions of R and although not that long ago, the intricacies of installing MSnbase and xcms in my case was every time a little different and my mind is refusing to remember each individual one.
I'll also chime in here - seems to be a busy issue :) R version 3.4.x should be OK, but you have an outdated version of MSnbase. Please run:
library(BiocInstaller)
biocLite(c("MSnbase", "mzR"))
this should bring you up to version 2.4.2 of MSnbase
. You shouldn't have to go through the struggles of installing from github.
Some additional clarification.
Any changes to Bioconductor packages are first released in the development branch of Bioconductor. As developers we also use github, although that should generally not be needed (and isn't recommended). The current versions in the stable release are
-
mzR
: 2.12.0 -
MSnbase
: 2.4.2 -
xcms
: 3.0.2
These require R 3.4.z, and can all be installed using BiocInstaller::biocLite()
Note that with that setup above, you will be able to use onDisk
mode, as described above, as this mode has been around for quite some time (that's what lead to version MSnbase
version 2).
If for some reason you are looking for mzR
2.13.z, MSnbase
2.5.z or xcms
3.1.z, then you'll need to switch the the development version of Bioconductor (using the same version of R for now). This can be done with BiocInstaller::useDevel()
. I don't recommend this, especially now as we are close to a new release, and these development versions will become the stable releases very soon.
As for Rhdflib
, it is a new dependency to mzR
, i.e. you don't need it using the official release, but do need if you use the development version. This explains why you encountered this annoyances that lead to head hitting the keyboard (sorry about that).
Summary: I would suggest you stick with all official release versions, as they should offer the functionality you need, and we should help you sort these issues out, as switching between releases and github will lead to other annoyances that will become irrelevant in a couple of weeks.
Hello all - thanks for the tips, assistance, and sympathy.....
Unless I am really botching things (which is certainly possible), there is still a problem.
See session info below, after having removed mzR and MSnbase libraries from my local computer, reinstalling fresh:
library(BiocInstaller) biocLite(c("MSnbase", "mzR"))
I still do not have an option for onDisk in the readMSData function, and am not being updated to the versions I should be.
R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.28.0 MSnbase_2.2.0 ProtGenerics_1.8.0 BiocParallel_1.10.1 Biobase_2.36.2
[6] BiocGenerics_0.22.1 mzR_2.10.0 Rcpp_0.12.16
loaded via a namespace (and not attached):
[1] IRanges_2.10.5 zlibbioc_1.22.0 doParallel_1.0.11 munsell_0.4.3 colorspace_1.3-2
[6] impute_1.50.1 lattice_0.20-35 rlang_0.2.0 foreach_1.4.4 plyr_1.8.4
[11] tools_3.4.4 mzID_1.14.0 grid_3.4.4 gtable_0.2.0 affy_1.54.0
[16] iterators_1.0.9 digest_0.6.15 yaml_2.1.18 lazyeval_0.2.1 preprocessCore_1.38.1
[21] tibble_1.4.2 affyio_1.46.0 ggplot2_2.2.1 S4Vectors_0.14.7 codetools_0.2-15
[26] MALDIquant_1.17 limma_3.32.10 compiler_3.4.4 pillar_1.2.1 pcaMethods_1.68.0
[31] scales_0.5.0 stats4_3.4.4 XML_3.98-1.10 vsn_3.44.0
I think that a loaded R package was preventing updates from occurring properly. I think I have it working now. Thanks again for all the guidance. R continues to humble me....
Corey
@lgatto I realize this is a closed thread, but i actually think this is the best place to deposit some benchmarking data. I was again playing with speeds for onDisk vs inMemory and CDF inMemory is remarkably slow. I tried this more than once (it is slow slow it is pretty hard to justify testing this more than a few times), and the results are consistent. mzML is more similar to what i would expect, but a small CDF file takes forever to read into memory.
Clearly, the short answer is still 'use onDisk!', but this is such anomolous behavior i thought it worth bringing to your attention. GC-MS files still are primarily exported by vendor software as .cdf format. Pwiz could probably be used instead, but i haven't tried it.
library("MSnbase")
small GC-MS CDF file
######################## f <- "C:/Temp/alkane mix C8-C20_08.cdf" utils:::format.object_size(file.size(f), "auto")
[1] "61.1 Mb"
system.time(ondisk <- readMSData(f, msLevel = 1, mode = "onDisk", centroided = TRUE))
user system elapsed
0.36 0.00 0.38
system.time(inmem <- readMSData(f, msLevel = 1, mode = "inMemory", centroided = TRUE))
user system elapsed
7383.36 1560.45 8950.26
SWITCH TO mzML File
######################## f <- "C:/Temp/20191212_028.mzML" utils:::format.object_size(file.size(f), "auto")
[1] "543.1 Mb"
system.time(ondisk <- readMSData(f, msLevel = 1, mode = "onDisk", centroided = TRUE))
user system elapsed
0.42 0.04 1.36
system.time(inmem <- readMSData(f, msLevel = 1, mode = "inMemory", centroided = TRUE))
user system elapsed
25.09 2.81 27.87
sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] pryr_0.1.4 MSnbase_2.14.2 ProtGenerics_1.20.0 S4Vectors_0.26.1 mzR_2.22.0
[6] Rcpp_1.0.5 Biobase_2.48.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_4.0.3 pillar_1.4.6 plyr_1.8.6 iterators_1.0.13
[6] zlibbioc_1.34.0 tools_4.0.3 digest_0.6.26 ncdf4_1.17 MALDIquant_1.19.3
[11] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.4 preprocessCore_1.50.0 gtable_0.3.0
[16] lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.8 foreach_1.5.1 rstudioapi_0.11
[21] yaml_2.2.1 xfun_0.18 gridExtra_2.3 stringr_1.4.0 knitr_1.30
[26] dplyr_1.0.2 IRanges_2.22.2 generics_0.0.2 vctrs_0.3.4 grid_4.0.3
[31] tidyselect_1.1.0 glue_1.4.2 impute_1.62.0 R6_2.4.1 XML_3.99-0.5
[36] BiocParallel_1.22.0 rmarkdown_2.5 limma_3.44.3 ggplot2_3.3.2 purrr_0.3.4
[41] magrittr_1.5 htmltools_0.5.0 scales_1.1.1 pcaMethods_1.80.0 codetools_0.2-16
[46] ellipsis_0.3.1 MASS_7.3-53 mzID_1.26.0 colorspace_1.4-1 stringi_1.5.3
[51] affy_1.66.0 doParallel_1.0.16 munsell_0.5.0 vsn_3.56.0 crayon_1.3.4
[56] affyio_1.58.0
Thanks @cbroeckl for the benchmark. Interesting, I was not expecting that netCDF is that slow. Just to explain: the initial import of the data with the onDisk
backend usually is fast, but remember that any operation on m/z or intensity values will have to read/import the respective data from the original CDF file. You would also have to compare the timing of a call like intensity
or mz
on the inMemory and onDisk mode objects.
Generally, I think we can not do much to improve or tune the performance of the netCDF import in MSnbase
- this is all limited by the mzR
and that again uses the ncdf4
package. Could you eventually share one of your CDF files for me to check?
A workaround solution for you could also be to convert your CDF files to mzML (by reading them with readMSData
and exporting them again with writeMSData
) - you could use this in e.g. a batch script to automatically convert all CDF files.
@jorainer - happy to share the specific file i used.
I do understand that this may be beyond fixing, just wanted to post this for posterity sake in case others see similar delay.
Thanks!
Thanks @cbroeckl ! Some benchmarks on the file you provided:
library(xcms)
library(microbenchmark)
cdf <- "alkane mix C8-C20_08.cdf"
mzml <- sub("cdf", "mzML", cdf)
data <- readMSData(cdf, mode = "onDisk")
writeMSData(data, file = mzml)
Speed of importing just the header data:
microbenchmark(readMSData(cdf, mode = "onDisk"), readMSData(mzml, mode = "onDisk"), times = 10)
Unit: milliseconds
expr min lq mean median
readMSData(cdf, mode = "onDisk") 512.9263 536.2836 547.8303 550.5631
readMSData(mzml, mode = "onDisk") 1002.6475 1006.5494 1164.1963 1096.7649
uq max neval cld
567.7189 570.768 10 a
1256.0380 1534.180 10 b
here the CDF file import is actually faster. Next checking the speed of extracting all m/z values:
data_cdf <- readMSData(cdf, mode = "onDisk")
data_mzml <- readMSData(mzml, mode = "onDisk")
microbenchmark(mz(data_cdf), mz(data_mzml), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
mz(data_cdf) 3.020813 3.268124 3.391805 3.377163 3.576331 3.741072 10 b
mz(data_mzml) 2.677920 2.796839 2.985670 2.866433 3.049272 3.666313 10 a
there seems to be only a very small difference if loading the full data. Next we compare what happens on a single spectrum.
microbenchmark(data_cdf[[1234]], data_mzml[[1234]], times = 20)
Unit: milliseconds
expr min lq mean median uq max
data_cdf[[1234]] 530.65449 575.26221 662.43060 644.65432 719.64347 1057.76206
data_mzml[[1234]] 40.54484 42.01771 43.54326 43.24457 44.07587 51.28742
neval cld
20 b
20 a
that's indeed interesting. Thus, mzML allows a much faster access to individual spectra within the file - most likely because the mzML files are indexed. Could it be that the CDF files need to be indexed first (no idea with what software that could be done)?
Again, a workaround could be to convert all CDF files to mzML (e.g. like above with readMSData
and writeMSData
).