xcms icon indicating copy to clipboard operation
xcms copied to clipboard

Processing thousands of file with XCMS crash

Open adelabriere opened this issue 4 years ago • 6 comments

Dear XCMS team,

I ran into a recurring issue while to process thousands of files with XCMS, I have 2500 files of 10 Mb centroided, each contains with my peakpicking parameters 7000-8000 peaks. I process them useing a MulticoreParam with 16 cores as it is a Linux machine with 20 cores.

Even on a machione with 32 Gb RAM and 20 Cores the process does not finish in crash at the rt correction procedure with the following message.


Detecting chromatographic peaks in 16610 regions of interest ... OK: 8409 found.
Detecting mass traces at 10.3648152902968 ppm ... OK
Detecting chromatographic peaks in 15906 regions of interest ... OK: 7796 found.
Detecting mass traces at 10.3648152902968 ppm ... OK
Detecting chromatographic peaks in 15624 regions of interest ... OK: 7638 found.
wpeakpicking: 2.818694 
Sample number 1219 used as center sample.
Error in result[[njob]] <- value : 
  attempt to select less than one element in OneIndex
Calls: adjustRtime ... bplapply -> bplapply -> bplapply -> bploop -> bploop.lapply
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result
Execution halted

It works with SerialParam but it is very slow.

It there something to do to allows the parallel processing ? I don t think memory should be an issue here. I encountered this bug many time and I have the felling it is a strange paralllel bug which is impossible to fix on your side, but if you have a fix I ll be very happy.

Best, Alexis Delabriere

adelabriere avatar Jun 18 '20 12:06 adelabriere

Hi, thanks for reporting. So success in case of SerialParam tells us all files and the parameters are OK. So it is 16 parallel processes that break things. I wonder if the BioC people have a place to discuss (bioc-devel ?) since (I hope) xcms/mzR/MSnbase are not the culprit alone. Yours, Steffen

sneumann avatar Jun 18 '20 13:06 sneumann

I got the same errors recently while processing large experiments - and this was related to the system running out of memory (64GB memory, 6 cores; in fact I did run R in a docker that run out of memory). So, the error message is unfortunately not very helpful.

32GB does not sound like much if you run 16 processes in parallel (it will be less than 2 GB per process as linux will also need some memory. Also, remember that for peak detection you will have 16 processes that read from 16 different files in parallel! So the disk i/o can become a real bottleneck here. For that reason I usually don't run analyses on more than 10 cores even if I process ~ 5000 files.

What I would suggest is the following:

  • peak detection can be done in parallel, does not require a lot of memory.
  • for alignment and correspondence analysis I would either reduce the number of parallel processes or use SerialParam() instead.

Note, you can change the default parallel processing setup at any stage in your script, e.g.

## peak detection
register(MulticoreParam(10))
xdata <- findChromPeaks...

## alignment
register(MulticoreParam(2))
xdata <- groupChromPeaks(...
xdata <- adjustRtime(...

## correspondence
register(SerialParam())
xdata <- groupChromPeaks(...

jorainer avatar Jun 22 '20 17:06 jorainer

Hi thanks for your suggestion,

I ll try that.

adelabriere avatar Jun 22 '20 19:06 adelabriere

just wanted to tack a general question onto this thread. It seems to me that the memory demands of XCMS3 are larger than they were in the original configuration. Does this have to do with BPPARAM() or is there something in the new XCMS format that increases memory usage to improve performance? I have 24GB on my office desktop and where i used to be able to process my LC-MS files there, i cannot any longer. Even setting to use only 2 cores on a beefier processing computer (64 GB RAM), R sessions take up about 20+ GB of memory (full raw data on disk is around 0.5 GB). i have seen an R package utilizing MSnBase take the approach of calling several parallel sessions in series - i.e. process 8 files with 4 cores, then stopCluster. Continue to next 8 files, stopCluster once they are done - the claim was that it reduced memory demands.

I don't have any real issues here, but am trying to understand what i see - didn't seem worth opening a new issue. thanks for any feedback on this subject (maybe there is a document in place somwhere?).

cbroeckl avatar Dec 11 '20 20:12 cbroeckl

Hi @cbroeckl , yes, with the XCMSnExp we might indeed have a larger memory footprint than before - mostly because there have been added more and more information (spectra header columns) which is imported from the mzML files. What you can do there is to reduce the feature data to only columns that you are interested or that are required. Have a look at fData(xdata) where xdata is the MSnExp object that you get after reading the data with readMSData and check which columns contain data and which you want to keep. You can then filter this information using the selectFeatureData e.g.:

head(fData(xdata))
my_cols <- c(<columns I'm interested>)
## Add also required ones
my_cols <- unique(c(MSnbase:::.MSnExpReqFvarLabels, my_cols))
xdata <- selectFeatureData(xdata, my_cols)

Another reason for the larger memory requirement are that we added some more columns to the chromPeak matrix: now we're also recording the MS level and whether or not peaks are filled in. This are just two columns, but depending on the number of peaks you have that can result in quite some memory demand. Note: we put this information now in a separate DataFrame which is parallel to the chromPeaks matrix to save memory: the chromPeaks matrix is a numeric matrix with double precision but e.g. MS level is only an integer, needs thus half the memory of an numeric. So, we tried to save some memory with that.

The BiocParallel and the BPPARAM should not have that much of an influence in the end. They don't increase the memory demand per se. Note however that it depends all a little also on the system on which you run xcms: unix systems with MulticoreParam() are able to use shared memory, so this will need less memory than e.g. Windows on which only SnowParam() can be used. On Windows systems, if you run 4 parallel R processes, they all will copy the memory content of the original R session.

In the very long run I would like to replace all the MSnbase objects with the newer objects from the Spectra package. These would allow to use different backends to store/represent the data. There we can use even more memory efficient solutions - or switch on the fly between on-disk and in-memory data representations.

jorainer avatar Dec 14 '20 09:12 jorainer

@jorainer - thanks for the great response. I expect that my observations are largely derived from the fact that i am on windows, using Snow, based on your description. Clearly having 1 xcmsObject/xcmsRaw object is far less memory demand than 5 (of each)! I will also explore the select-columns approach to see whether that improves performance.

cbroeckl avatar Dec 14 '20 14:12 cbroeckl