Optimisation of resources for the workflow
Description of feature
Currently, quantms have seven major categories for resources or processes:
withLabel:process_single {
cpus = { 1 }
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel:process_low {
cpus = { 4 * task.attempt }
memory = { 12.GB * task.attempt }
time = { 6.h * task.attempt }
}
withLabel:process_very_low {
cpus = { 2 * task.attempt}
memory = { 4.GB * task.attempt}
time = { 3.h * task.attempt}
}
withLabel:process_medium {
cpus = { 8 * task.attempt }
memory = { 36.GB * task.attempt }
time = { 8.h * task.attempt }
}
withLabel:process_high {
cpus = { 12 * task.attempt }
memory = { 72.GB * task.attempt }
time = { 16.h * task.attempt }
}
withLabel:process_long {
time = { 20.h * task.attempt }
}
withLabel:process_high_memory {
memory = { 200.GB * task.attempt }
}
However, some of my current analyses showed that resource usage, for example, for DIA analysis, could be optimized much more at the process level. See some results from my analyses.
### Dataset: PXD030304
CPU Usage:
Memory Usage:
IO Usage:
Most of the processes are under 50% of usage of memory and CPU which looks like a waist of resources?
My plan was always to use the results of your thousands of runs to learn a simple regression model for each step based on file size and or number of spectra. But I am not sure if you ever saved the execution logs.
I did it for most of the runs. However, you don't really need a huge data to be able to learn simple things. Some conclusions, easy ones:
samplecheet_checkandsdrf_parsingare way-way over their memory requirements, they can now go easily to 1GB memory, currently we are given to them 6GB.
Well, yes, but I wasnt talking about those easy things. Of course you can add smaller labels for those.
I think the orher ones depends heavily of the mzML size, number of MS and MS/MS I guess, even type of instrument, or file size.
That's why I said learning from your results..
All this information is available when starting a run..
Would be a unique and potentially publishable feature of the pipeline. There is still the retry functionality if the resources were not enough. But I assume there should be some very informative features that allow for a very accurate prediction of resource usage.
Yes, the idea is to optimize the pipeline for each process for 80% of the runs, if the 20 fails, it can go to the next retry. Before doing the research we have to think if is needed to have the info inside the files, MS and MS/MS. because if for the model that information is needed, then we will need to block all process until mzml_statistics finish?
Yes that is true.
I will argue that in the first iteration, we look for simple variables, file size, instrument, experiment type - DIA - DDA, and search parameters (database, mods, etc). That could be the first iteration.
This is DDA-TMT dataset PXD010557:
Memory Usage:
CPU Usage:
IO Usage:
I think we cannot predict CPU usage. We need to know from the implementation if it benefits from multiple cores. Depending on the implementation multiple cores also means a bit more RAM because more data is loaded at the same time or copies are made for thread-safety
You can also subsample and average statistics from some files to get a much better idea.