MSnbase CDF read seems slow

Hello, XCMS user here, finally trying to adapt an older workflow to new XCMS/MSnbase. Most of the files I work with are in .cdf format. Waters Databridge spits data out in this format.

raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, cache = 0)

there are 14 cdf files in this test set. This process took 7.3 hours (I had misformatted my pheno dataframe so it errored at that point, but I assume the 7.3 hours is a representative minimum. I tried using the mzR process in isolation (msdata <- mzR::openMSfile(f, backend = "netCDF")) , and this was as fast as previously experienced. The time consuming part is in the for loop (for (f in files) {} ). I am running this from a solid state local drive, so I can rule out network issues. This is my first go at this, but this seems like an abnormal amount of time to read in the data. If it helps I can post a couple of files. Thanks, Corey

session info:

R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 mzR_2.10.0 Rcpp_0.12.16
[6] BiocParallel_1.10.1 Biobase_2.36.2 BiocGenerics_0.22.1

loaded via a namespace (and not attached): [1] pillar_1.2.1 compiler_3.4.2 BiocInstaller_1.26.1 RColorBrewer_1.1-2
[5] plyr_1.8.4 iterators_1.0.9 tools_3.4.2 zlibbioc_1.22.0
[9] MALDIquant_1.17 digest_0.6.15 preprocessCore_1.38.1 tibble_1.4.2
[13] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0 Matrix_1.2-11
[17] foreach_1.4.4 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.2
[21] multtest_2.32.0 grid_3.4.2 impute_1.50.1 survival_2.41-3
[25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1
[29] MASS_7.3-47 splines_3.4.2 scales_0.5.0 pcaMethods_1.68.0
[33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2
[37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[41] vsn_3.44.0 affyio_1.46.0

Apr 12 '18 15:04 cbroeckl

Please use the following to read your raw data in:

raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, mode = "onDisk")

This won't load all the raw data into memory but only access it when necessary. For further details, see this short benchmarking.

I'll close the issue, but feel free to re-open it and ask for clarifications if needed.

Thank you for your interest in the software!

Apr 12 '18 18:04 lgatto

Thanks - I figured it was an easy fix.

Apr 12 '18 19:04 cbroeckl

raw_data <- readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, mode = "onDisk") Error in readMSData(files = filedata$files, msLevel. = 1, centroided = TRUE, : unused argument (mode = "onDisk")

?readMSData help file also does not report 'mode' as an option.

Apr 12 '18 19:04 cbroeckl

Update from github that has v2.5.12?

Apr 12 '18 19:04 stanstrup

Yes, you need to update your MSnbase installation, and possibly R. What's your sessionInfo()?

Apr 12 '18 19:04 lgatto

Thanks to yo both!
I will update everything and let you know what happens.

Current session info:

R version 3.4.2 (2017-09-28) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 mzR_2.10.0 Rcpp_0.12.16 [6] BiocParallel_1.10.1 Biobase_2.36.2 BiocGenerics_0.22.1

loaded via a namespace (and not attached): [1] pillar_1.2.1 compiler_3.4.2 BiocInstaller_1.26.1 RColorBrewer_1.1-2 [5] plyr_1.8.4 iterators_1.0.9 tools_3.4.2 zlibbioc_1.22.0 [9] MALDIquant_1.17 digest_0.6.15 preprocessCore_1.38.1 tibble_1.4.2 [13] gtable_0.2.0 lattice_0.20-35 rlang_0.2.0 Matrix_1.2-11 [17] foreach_1.4.4 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.2 [21] multtest_2.32.0 grid_3.4.2 impute_1.50.1 survival_2.41-3 [25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1 [29] MASS_7.3-47 splines_3.4.2 scales_0.5.0 pcaMethods_1.68.0 [33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2 [37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11 [41] vsn_3.44.0 affyio_1.46.0

Apr 12 '18 19:04 cbroeckl

New R, Newest BioC versions of XCMS and MSnbase, same error. I will try to update next from github.

R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] xcms_1.52.0 MSnbase_2.2.0 ProtGenerics_1.8.0 BiocParallel_1.10.1 mzR_2.10.0
[6] Rcpp_0.12.16 Biobase_2.36.2 BiocGenerics_0.22.1 BiocInstaller_1.26.1

loaded via a namespace (and not attached): [1] pillar_1.2.1 compiler_3.4.4 RColorBrewer_1.1-2 plyr_1.8.4
[5] iterators_1.0.9 tools_3.4.4 zlibbioc_1.22.0 MALDIquant_1.17
[9] digest_0.6.15 tibble_1.4.2 preprocessCore_1.38.1 gtable_0.2.0
[13] lattice_0.20-35 rlang_0.2.0 Matrix_1.2-12 foreach_1.4.4
[17] yaml_2.1.18 S4Vectors_0.14.7 IRanges_2.10.5 stats4_3.4.4
[21] multtest_2.32.0 grid_3.4.4 impute_1.50.1 survival_2.41-3
[25] XML_3.98-1.10 RANN_2.5.1 limma_3.32.10 ggplot2_2.2.1
[29] MASS_7.3-49 splines_3.4.4 scales_0.5.0 pcaMethods_1.68.0
[33] codetools_0.2-15 MassSpecWavelet_1.42.0 mzID_1.14.0 colorspace_1.3-2
[37] affy_1.54.0 lazyeval_0.2.1 munsell_0.4.3 doParallel_1.0.11
[41] vsn_3.44.0 affyio_1.46.0

Apr 12 '18 21:04 cbroeckl

Version incompatibility is preventing me from testing this.

install_github("lgatto/MSnbase") ... Error : package 'mzR' 2.10.0 was found, but >= 2.13.6 is required by 'MSnbase'

install_github("sneumann/mzR") .... ERROR: dependency 'Rhdf5lib' is not available for package 'mzR'

install.packages("Rhdf5lib") ... package ‘Rhdf5lib’ is not available (for R version 3.4.4)

...drops head onto keyboard....

Apr 12 '18 21:04 cbroeckl

@cbroeckl I just did the same as you over Easter with similar head banging. No worries though it gets better. Try also installing Rhdflib5 and mzR from github.

Apr 13 '18 04:04 SiggiSmara

What I did was installing them in the order of Rhdflib, mzR, and finally MSNbase. And be aware that R might re-compile the dependent packages when installing the next one, also a bit frustrating but it “should” work in the end.

Apr 13 '18 04:04 SiggiSmara

Sorry for repeat comments, I’m on my iphone... one more warning is that you might also need to install the development version of R for this to work. I basically re-installed several versions of R and although not that long ago, the intricacies of installing MSnbase and xcms in my case was every time a little different and my mind is refusing to remember each individual one.

Apr 13 '18 04:04 SiggiSmara

I'll also chime in here - seems to be a busy issue :) R version 3.4.x should be OK, but you have an outdated version of MSnbase. Please run:

library(BiocInstaller)
biocLite(c("MSnbase", "mzR"))

this should bring you up to version 2.4.2 of MSnbase. You shouldn't have to go through the struggles of installing from github.

Apr 13 '18 05:04 jorainer

Some additional clarification.

Any changes to Bioconductor packages are first released in the development branch of Bioconductor. As developers we also use github, although that should generally not be needed (and isn't recommended). The current versions in the stable release are

mzR: 2.12.0
MSnbase: 2.4.2
xcms: 3.0.2

These require R 3.4.z, and can all be installed using BiocInstaller::biocLite()

Note that with that setup above, you will be able to use onDisk mode, as described above, as this mode has been around for quite some time (that's what lead to version MSnbase version 2).

If for some reason you are looking for mzR 2.13.z, MSnbase 2.5.z or xcms 3.1.z, then you'll need to switch the the development version of Bioconductor (using the same version of R for now). This can be done with BiocInstaller::useDevel(). I don't recommend this, especially now as we are close to a new release, and these development versions will become the stable releases very soon.

As for Rhdflib, it is a new dependency to mzR, i.e. you don't need it using the official release, but do need if you use the development version. This explains why you encountered this annoyances that lead to head hitting the keyboard (sorry about that).

Summary: I would suggest you stick with all official release versions, as they should offer the functionality you need, and we should help you sort these issues out, as switching between releases and github will lead to other annoyances that will become irrelevant in a couple of weeks.

Apr 13 '18 06:04 lgatto

Hello all - thanks for the tips, assistance, and sympathy.....

Unless I am really botching things (which is certainly possible), there is still a problem.

See session info below, after having removed mzR and MSnbase libraries from my local computer, reinstalling fresh:

library(BiocInstaller) biocLite(c("MSnbase", "mzR"))

I still do not have an option for onDisk in the readMSData function, and am not being updated to the versions I should be.

R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] BiocInstaller_1.28.0 MSnbase_2.2.0 ProtGenerics_1.8.0 BiocParallel_1.10.1 Biobase_2.36.2
[6] BiocGenerics_0.22.1 mzR_2.10.0 Rcpp_0.12.16

loaded via a namespace (and not attached): [1] IRanges_2.10.5 zlibbioc_1.22.0 doParallel_1.0.11 munsell_0.4.3 colorspace_1.3-2
[6] impute_1.50.1 lattice_0.20-35 rlang_0.2.0 foreach_1.4.4 plyr_1.8.4
[11] tools_3.4.4 mzID_1.14.0 grid_3.4.4 gtable_0.2.0 affy_1.54.0
[16] iterators_1.0.9 digest_0.6.15 yaml_2.1.18 lazyeval_0.2.1 preprocessCore_1.38.1 [21] tibble_1.4.2 affyio_1.46.0 ggplot2_2.2.1 S4Vectors_0.14.7 codetools_0.2-15
[26] MALDIquant_1.17 limma_3.32.10 compiler_3.4.4 pillar_1.2.1 pcaMethods_1.68.0
[31] scales_0.5.0 stats4_3.4.4 XML_3.98-1.10 vsn_3.44.0

Apr 13 '18 14:04 cbroeckl

I think that a loaded R package was preventing updates from occurring properly. I think I have it working now. Thanks again for all the guidance. R continues to humble me....

Corey

Apr 13 '18 16:04 cbroeckl

@lgatto I realize this is a closed thread, but i actually think this is the best place to deposit some benchmarking data. I was again playing with speeds for onDisk vs inMemory and CDF inMemory is remarkably slow. I tried this more than once (it is slow slow it is pretty hard to justify testing this more than a few times), and the results are consistent. mzML is more similar to what i would expect, but a small CDF file takes forever to read into memory.

Clearly, the short answer is still 'use onDisk!', but this is such anomolous behavior i thought it worth bringing to your attention. GC-MS files still are primarily exported by vendor software as .cdf format. Pwiz could probably be used instead, but i haven't tried it.

library("MSnbase")

small GC-MS CDF file

######################## f <- "C:/Temp/alkane mix C8-C20_08.cdf" utils:::format.object_size(file.size(f), "auto")

[1] "61.1 Mb"

system.time(ondisk <- readMSData(f, msLevel = 1, mode = "onDisk", centroided = TRUE))

user system elapsed

0.36 0.00 0.38

system.time(inmem <- readMSData(f, msLevel = 1, mode = "inMemory", centroided = TRUE))

user system elapsed

7383.36 1560.45 8950.26

SWITCH TO mzML File

######################## f <- "C:/Temp/20191212_028.mzML" utils:::format.object_size(file.size(f), "auto")

[1] "543.1 Mb"

system.time(ondisk <- readMSData(f, msLevel = 1, mode = "onDisk", centroided = TRUE))

user system elapsed

0.42 0.04 1.36

system.time(inmem <- readMSData(f, msLevel = 1, mode = "inMemory", centroided = TRUE))

user system elapsed

25.09 2.81 27.87

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] pryr_0.1.4 MSnbase_2.14.2 ProtGenerics_1.20.0 S4Vectors_0.26.1 mzR_2.22.0
[6] Rcpp_1.0.5 Biobase_2.48.0 BiocGenerics_0.34.0

loaded via a namespace (and not attached): [1] BiocManager_1.30.10 compiler_4.0.3 pillar_1.4.6 plyr_1.8.6 iterators_1.0.13
[6] zlibbioc_1.34.0 tools_4.0.3 digest_0.6.26 ncdf4_1.17 MALDIquant_1.19.3
[11] evaluate_0.14 lifecycle_0.2.0 tibble_3.0.4 preprocessCore_1.50.0 gtable_0.3.0
[16] lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.8 foreach_1.5.1 rstudioapi_0.11
[21] yaml_2.2.1 xfun_0.18 gridExtra_2.3 stringr_1.4.0 knitr_1.30
[26] dplyr_1.0.2 IRanges_2.22.2 generics_0.0.2 vctrs_0.3.4 grid_4.0.3
[31] tidyselect_1.1.0 glue_1.4.2 impute_1.62.0 R6_2.4.1 XML_3.99-0.5
[36] BiocParallel_1.22.0 rmarkdown_2.5 limma_3.44.3 ggplot2_3.3.2 purrr_0.3.4
[41] magrittr_1.5 htmltools_0.5.0 scales_1.1.1 pcaMethods_1.80.0 codetools_0.2-16
[46] ellipsis_0.3.1 MASS_7.3-53 mzID_1.26.0 colorspace_1.4-1 stringi_1.5.3
[51] affy_1.66.0 doParallel_1.0.16 munsell_0.5.0 vsn_3.56.0 crayon_1.3.4
[56] affyio_1.58.0

Feb 03 '21 14:02 cbroeckl

Thanks @cbroeckl for the benchmark. Interesting, I was not expecting that netCDF is that slow. Just to explain: the initial import of the data with the onDisk backend usually is fast, but remember that any operation on m/z or intensity values will have to read/import the respective data from the original CDF file. You would also have to compare the timing of a call like intensity or mz on the inMemory and onDisk mode objects.

Generally, I think we can not do much to improve or tune the performance of the netCDF import in MSnbase - this is all limited by the mzR and that again uses the ncdf4 package. Could you eventually share one of your CDF files for me to check?

A workaround solution for you could also be to convert your CDF files to mzML (by reading them with readMSData and exporting them again with writeMSData) - you could use this in e.g. a batch script to automatically convert all CDF files.

Feb 04 '21 06:02 jorainer

@jorainer - happy to share the specific file i used.

I do understand that this may be beyond fixing, just wanted to post this for posterity sake in case others see similar delay.

Thanks!

Feb 04 '21 15:02 cbroeckl

Thanks @cbroeckl ! Some benchmarks on the file you provided:

library(xcms)
library(microbenchmark)
cdf <- "alkane mix C8-C20_08.cdf"
mzml <- sub("cdf", "mzML", cdf)
data <- readMSData(cdf, mode = "onDisk")
writeMSData(data, file = mzml)

Speed of importing just the header data:

microbenchmark(readMSData(cdf, mode = "onDisk"), readMSData(mzml, mode = "onDisk"), times = 10)
Unit: milliseconds
                              expr       min        lq      mean    median
  readMSData(cdf, mode = "onDisk")  512.9263  536.2836  547.8303  550.5631
 readMSData(mzml, mode = "onDisk") 1002.6475 1006.5494 1164.1963 1096.7649
        uq      max neval cld
  567.7189  570.768    10  a 
 1256.0380 1534.180    10   b

here the CDF file import is actually faster. Next checking the speed of extracting all m/z values:

data_cdf <- readMSData(cdf, mode = "onDisk")
data_mzml <- readMSData(mzml, mode = "onDisk")
microbenchmark(mz(data_cdf), mz(data_mzml), times = 10)
Unit: seconds
          expr      min       lq     mean   median       uq      max neval cld
  mz(data_cdf) 3.020813 3.268124 3.391805 3.377163 3.576331 3.741072    10   b
 mz(data_mzml) 2.677920 2.796839 2.985670 2.866433 3.049272 3.666313    10  a

there seems to be only a very small difference if loading the full data. Next we compare what happens on a single spectrum.

microbenchmark(data_cdf[[1234]], data_mzml[[1234]], times = 20)
Unit: milliseconds
              expr       min        lq      mean    median        uq        max
  data_cdf[[1234]] 530.65449 575.26221 662.43060 644.65432 719.64347 1057.76206
 data_mzml[[1234]]  40.54484  42.01771  43.54326  43.24457  44.07587   51.28742
 neval cld
    20   b
    20  a

that's indeed interesting. Thus, mzML allows a much faster access to individual spectra within the file - most likely because the mzML files are indexed. Could it be that the CDF files need to be indexed first (no idea with what software that could be done)?

Again, a workaround could be to convert all CDF files to mzML (e.g. like above with readMSData and writeMSData).

Feb 22 '21 14:02 jorainer

MSnbase MSnbase copied to clipboard

CDF read seems slow

small GC-MS CDF file

[1] "61.1 Mb"

user system elapsed

0.36 0.00 0.38

user system elapsed

7383.36 1560.45 8950.26

SWITCH TO mzML File

[1] "543.1 Mb"

user system elapsed

0.42 0.04 1.36

user system elapsed

25.09 2.81 27.87

MSnbase
MSnbase copied to clipboard