quantms icon indicating copy to clipboard operation
quantms copied to clipboard

Optimisation of resources for the workflow

Open ypriverol opened this issue 1 year ago • 13 comments

Description of feature

Currently, quantms have seven major categories for resources or processes:

withLabel:process_single {
        cpus   = { 1                   }
        memory = { 6.GB * task.attempt }
        time   = { 4.h  * task.attempt }
    }
    withLabel:process_low {
        cpus   = { 4     * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 6.h   * task.attempt }
    }
    withLabel:process_very_low {
        cpus   = { 2     * task.attempt}
        memory = { 4.GB  * task.attempt}
        time   = { 3.h   * task.attempt}
    }
    withLabel:process_medium {
        cpus   = { 8     * task.attempt }
        memory = { 36.GB * task.attempt }
        time   = { 8.h   * task.attempt }
    }
    withLabel:process_high {
        cpus   = { 12    * task.attempt }
        memory = { 72.GB * task.attempt }
        time   = { 16.h  * task.attempt }
    }
    withLabel:process_long {
        time   = { 20.h  * task.attempt }
    }
    withLabel:process_high_memory {
        memory = { 200.GB * task.attempt }
    }

However, some of my current analyses showed that resource usage, for example, for DIA analysis, could be optimized much more at the process level. See some results from my analyses.

### Dataset: PXD030304

CPU Usage:

Screenshot 2024-12-05 at 09 08 27

Memory Usage:

Screenshot 2024-12-05 at 09 08 43

IO Usage:

Screenshot 2024-12-05 at 09 08 55

Most of the processes are under 50% of usage of memory and CPU which looks like a waist of resources?

ypriverol avatar Dec 05 '24 09:12 ypriverol

My plan was always to use the results of your thousands of runs to learn a simple regression model for each step based on file size and or number of spectra. But I am not sure if you ever saved the execution logs.

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

I did it for most of the runs. However, you don't really need a huge data to be able to learn simple things. Some conclusions, easy ones:

  • samplecheet_check and sdrf_parsing are way-way over their memory requirements, they can now go easily to 1GB memory, currently we are given to them 6GB.

ypriverol avatar Dec 05 '24 09:12 ypriverol

Well, yes, but I wasnt talking about those easy things. Of course you can add smaller labels for those.

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

I think the orher ones depends heavily of the mzML size, number of MS and MS/MS I guess, even type of instrument, or file size.

ypriverol avatar Dec 05 '24 09:12 ypriverol

That's why I said learning from your results..

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

All this information is available when starting a run..

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

Would be a unique and potentially publishable feature of the pipeline. There is still the retry functionality if the resources were not enough. But I assume there should be some very informative features that allow for a very accurate prediction of resource usage.

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

Yes, the idea is to optimize the pipeline for each process for 80% of the runs, if the 20 fails, it can go to the next retry. Before doing the research we have to think if is needed to have the info inside the files, MS and MS/MS. because if for the model that information is needed, then we will need to block all process until mzml_statistics finish?

ypriverol avatar Dec 05 '24 09:12 ypriverol

Yes that is true.

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

I will argue that in the first iteration, we look for simple variables, file size, instrument, experiment type - DIA - DDA, and search parameters (database, mods, etc). That could be the first iteration.

ypriverol avatar Dec 05 '24 09:12 ypriverol

This is DDA-TMT dataset PXD010557:

Memory Usage:

Screenshot 2024-12-05 at 09 51 00

CPU Usage:

Screenshot 2024-12-05 at 09 51 11

IO Usage:

Screenshot 2024-12-05 at 09 51 23

ypriverol avatar Dec 05 '24 09:12 ypriverol

I think we cannot predict CPU usage. We need to know from the implementation if it benefits from multiple cores. Depending on the implementation multiple cores also means a bit more RAM because more data is loaded at the same time or copies are made for thread-safety

jpfeuffer avatar Dec 05 '24 09:12 jpfeuffer

You can also subsample and average statistics from some files to get a much better idea.

timosachsenberg avatar Dec 05 '24 12:12 timosachsenberg