mag icon indicating copy to clipboard operation
mag copied to clipboard

Parse customized parameters to module KRAKEN2

Open lam-c opened this issue 1 year ago • 3 comments

Description of feature

Thank you to bring this amazing tool, which really boosts the efficiency of running the metagenome pipeline! I've been using it for quite a long time and almost cannot leave it in my daily work. I would appreciate it if you could take some time to help with my questions below:

  • First, it would be more convenient and more efficient to successfully finish the pipeline, if users are allowed to customize parameters in module kraken2 (for example, kraken2's recently updated nt database runs out of memory without parameter --memory-mapping)
  • Second, I want some tips on specifying the resource allocation, since tasks (bowtie related) often failed due to not having enough memory or not finishing in a limited time. It seems that --max-time and --max-memory not working
  • Third, is there any possibility to start from middle step (e.g. binning refinement, assume that I have processed bins already, from in-house scripts instead of mag. Can I parse it to mag and get the downstream results)?

lam-c avatar Jun 12 '23 02:06 lam-c

Hi!

  1. You can customise kraken2's parameters by modifying ext.args with a custom config, e.g.:

custom.config

process {
    withName: KRAKEN2 {
        ext.args = '--quiet --memory-mapping'
    }
}

You can then specify -c custom.config when calling the pipeline.

However, you might also be interested in the taxprofiler pipeline (https://nf-co.re/taxprofiler), which allows more flexibility in this regard if you're wanting to do read profiling.

  1. You can adjust the resources available to different processes using a custom config file as well. See https://nf-co.re/docs/usage/configuration#tuning-workflow-resources for more details. Note that using the check_max() function is a bit finnicky as you need to copy it into the custom config file in order to use it, so if you know the maximum resources you have available a-priori, you might be better off just ignoring it and specifing a resource level that works. Exact numbers will be hard to recommend as they will depend on the size of your data and the power of your computer.

  2. We recently added an option to supply a CSV of assemblies to the pipeline, skipping the assembly stage and moving straight to binning (https://github.com/nf-core/mag/pull/439). This should be available in the next release, or you can use the dev version to run it now. I don't think anyone's suggested an entry point at the post-binning stage, but I don't think this would be too difficult to implement, if you wanted to give it a shot!

prototaxites avatar Jun 12 '23 13:06 prototaxites

Note for 2. the max_* parameters don't increase anything, but just adjust the maximum threshold a step of the pipeline can request (to prevent killing of jobs e.g. on a cluster). This is because nf-core pipelines have a retry functionality so if it detects a process runs out of memory with the initial increase, it'll submit again with double memory/cpu etc.. so the max_ parameters stop that doubling forever 😬

The link @prototaxites refers to the actual way to increase/decrease the resources requested by a step 👍

jfy133 avatar Jun 12 '23 13:06 jfy133

Thank you all for your patience and elaboration, that really helps a lot. Also, the entry point at binning stage #439 also fits my need.

lam-c avatar Jun 13 '23 05:06 lam-c