xcms icon indicating copy to clipboard operation
xcms copied to clipboard

benchmarking xcms by OS/computer specs

Open cbroeckl opened this issue 3 years ago • 7 comments

I am looking for processing guidance. This question is prompted primarily by larger datafile sizes that come with newer (faster, more sensitive) instruments.

We have a decently beefy computer, i5, 6 cores, 12 hyperthread, 64 GB RAM, Win10 computer, on which we do most of our in-house processing. When we updated to a new Q-TOF, we started scanning faster because the sensitivity allowed it. it is also higher res, which results in larger file sizes, even for centroid data. In practice, file sizes have increased by about 5 fold. Processing time in XCMS has also increased considerably. I do not think it is 5x longer, but i would estimate 2-3x longer. what this means is that what once took two days now takes a week. I am worried about being processing time limited in our services. I do have some other resources to try to tap for processing time, but i want to ask more of an approach question for now.

We run windows because we can install programs like MassLynx and other vendor data viewers on the computer, enabling easy GUI based manual evaluation of the data. I know that Windows and R have a rocky relationship, particularly with regards to memory - ie issue #492. I also know that new processors will be faster, we aren't generally memory limited on the existing computer, and Linux is loved by informaticians everywhere.

Trying to make educated decisions on investing in computer infrastructure is hard for me, because i do not know what a reasonable expectation is for improvements in processing speed with newer hardware or what the expected platform improvements may offer. Some specific questions:

  1. How much more memory efficient is Linux than windows for multi core processing?
  2. How much faster might processing occur on linux vs windows, assuming the same hardware?
  3. How much faster might processing occur with updated processors? ie moving from a 6 Core i5 to a 8 core i9? Can we just look at the speed specs of the processor to estimate this?

I am curious to know whether any benchmark data exist to answer any of the above questions, and if not, if you (collectively) might offer some guidance on this. Thanks in advance to all the XCMS gurus that respond!

cbroeckl avatar Jan 12 '21 16:01 cbroeckl

Hi, a fun mini-project would be to benchmark several AWS instances / CPUs https://aws.amazon.com/ec2/instance-types/ between Windows & Linux, since the "hardware" should be comparable. And you see mow much time e.g. a doubled RAM saves you. One would have to manually install R, and I recommend a single script responsible for running the entire benchmark incl. XCMS Install, Data Download and benchmarking the individual steps. Yours, Steffen

sneumann avatar Jan 12 '21 16:01 sneumann

Hi @cbroeckl , unfortunately I don't have any real data or benchmarks for all your questions. But I can provide some general remarks and suggestions (which base mainly on experience with our data and on the implementation of the xcms package):

  1. How much more memory efficient is Linux than windows for multi core processing?

This is also answered in #492. Windows and Socket-based parallel processing is very memory inefficient. I'm wondering if it would not make more sense to set up a linux server for xcms processing and have Windows running in a virtual machine on that server?

  1. How much faster might processing occur on linux vs windows, assuming the same hardware?

Hm, I would not expect to see tremendous differences there (except for the paralle processing issues).

Re CPUs: I expect an i9 to be superior to the i5, that comes not only with the eventually faster frequency but also by the fact that e.g. the access to memory is faster.

From own experience: we have a cluster with 400 CPUs and 3TB of memory. Still, I do not run my analyses with more than 10 CPUs in parallel (each having 16GB RAM). Reason: all our data is on a network attached file system. Thus, every I/O operation goes over the network - and the I/O of our filer is not too good. This can be a bottleneck in xcms because the data is retrieved on-the-fly from the original mzML files. Thus, if you have 10 processes in parallel, each of them will try to read from the disk and this can slow things down. Having fast I/O is also key.

With that I would suggest you the following:

  • CPU (ideally XEON or other server CPUs, they generally are very powerfull) with large L1 and/or L2 cache
  • Fast memory (> 64GB if possible): once you're running out of memory the OS will start to swap parts of the memory to disk (to use more memory than you have physically available). This will kill you for sure!
  • Fast disks (SSD instead of SATA if possible). Possibly in an efficient RAID configuration.
  • Linux as an OS with Windows in a virtual machine.

jorainer avatar Jan 13 '21 14:01 jorainer

Thanks @jorainer and @sneumann. Good tips! I will chew on this info for a while and start talking to our local IT support.

cbroeckl avatar Jan 13 '21 22:01 cbroeckl

i have accessed a linux server - 48 cores, 256 GB RAM. I am doing some benchmarking just using peak detection (centwave) for now. With reference to the above - windows is quickly maxing out RAM, such that i often can process data faster with 2 cores than i can with 4 (on a 24 GB RAM computer). Oddly, the windows task manager frequently does not show that RAM is all in use, so i was always assuming that it was not RAM that was limiting.

The Linux system with much more RAM has no problem processing 8 or 12 files at a time, and there is no I/O penalty up to 16 files in parallel. However, i am suspicious now that some of my RAM observations have to do with BiocParallel rather than XCMS data occupancy in RAM. I ran 16 files once on the server in parallel, and it ran OK. However, when i tried to replicate this it has consistently failed, and it seems to be due to memory limitations again. using the 'htop' linux command i can see how many cores are in use and memory usage, and the processing when i set the number of cores to 16 quickly climbs to right at 256 GB RAM, even proceeding for a few minutes before failing. If i use 8 cores, it is right at 128 GB RAM. While it is certainly possible that my files just happen to generate almost exactly 16 GB of RAM usage per core, i suspect it is a setting somewhere in the software.

https://bioconductor.org/packages/devel/bioc/vignettes/BiocParallel/inst/doc/Introduction_To_BiocParallel.pdf this document, on page 14-15, there is a command suggesting that SLURM is used to assign memory on a per-core basis.

[32] #SBATCH --mem-per-cpu=<%= resources$memory %> [33] <%= if (!is.null(resources$partition)) sprintf(paste0("#SBATCH --partition='", resources$partition, "'")) %> [34] <%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>

If i use the lines of code in this tutorial to replicate this on my linux system i see the same output, so Slurm in batchtools seems to be present at least (though i cannot load the batchtools package, oddly - i have tried). My files are approximately 1.5 GB in size, so i likely do not NEED 16 GB per core, unless there is a great deal of internal data copying occurring.

The htop monitoring suggests that the number of cores being used is accurate - i ask for 8, i see 8 cores with 100% usage. It is the RAM that is limiting. At this point, i suspect that the ram is limiting not because the data actually occupies 16 GB, but because BioCParallell is assigning 16 GB RAM/Core, and any other operation tips it over the edge. That said, i do not know how to determine whether this is the case. Any advice? Is there a way to set the amount of RAM/Core in BioCParallel?

cbroeckl avatar Apr 13 '21 21:04 cbroeckl

Some notes on evaluating memory consumption in R, I usually use peakRAM(<command to execute>) from the peakRAM package that reports the max memory usage of a call.

Regarding your files, it could well be that a 1.5GB mzML file requires much more memory if loaded into RAM. Remember that within the mzML file the m/z and intensity values are compressed. If you load them they will use real memory. A problem with the memory in R could be that R copies values. So, for some operations you might need twice the memory because R copies values over instead of changing them in place (that's the main difference over code in C or C++). What this means for parallel processing is: each task will perform peak detection and have a chromPeak matrix as a result, then, when all is calculated, these individual peak matrices need to be pasted together, and for that R needs twice the memory, once for each individual peak matrix and then for the pasted matrix. Such internals can cause a larger memory demand than expected.

AFAIK BiocParallel does not specifically set or request memory. If you use slurm it's different, because in slurm you can define the amount of memory you want to assign to each cpu or in total.

jorainer avatar Apr 14 '21 09:04 jorainer

@jorainer - thank you for the description. I could not find any settings anywhere in BiocParallel for assigning memory, so i think you are correct. slurm may, but i am not clear right now whether that is the source or not. i had forgotten that mzML is compressed on disk, that is worth noting. And while i was aware that R does perform some internal copying of data, i assumed that this would be once, but maybe it is more? i will also try to figure out how much memory the data occupies when it is read into memory, which presumably should be done in R using object.size()?

cbroeckl avatar Apr 14 '21 14:04 cbroeckl

FYI - a single mzML file (from pWiz) with an saved file size of about 1.5 GB was read in raw_data <- readMSData(files = "myFile.mzML", mode = "inMemory", msLevel = c(1,2))

MSnbase reports the files size: raw_data2 MSn experiment data ("MSnExp") Object size in memory: 2165.87 Mb

      • Spectra data - - - MS level(s): 1 Number of spectra: 5435 MSn retention times: 0:1 - 19:60 minutes
      • Processing information - - - Data loaded: Wed Apr 14 12:32:49 2021 MSnbase version: 2.16.1
      • Meta data - - - ...

So while the raw data in its full form is about 1.5 GB on disk, 2.2 GB in memory, and during processing i was using 16 GB/core. i am looking into schedulers on the server and/or batchtools as potential sources of this.

To be clear, i can just run 12 cores and be fine, which is still a huge improvement. I am just trying to understand what i can get away with and what i can't, as well as trying to understand how to process data on a linux server, which is new to me.

cbroeckl avatar Apr 14 '21 18:04 cbroeckl