taxprofiler icon indicating copy to clipboard operation
taxprofiler copied to clipboard

`Sourmash` profiler [DO NOT MERGE YET]

Open vmikk opened this issue 1 year ago • 1 comments

This PR adds sourmash as an additional profiler. Related to https://github.com/nf-core/taxprofiler/issues/112

NOTES:

  1. Database Sourmash supports several types of databases. For taxprofiler, I propose to use a single ZIP file containing signatures, but the database also requires a CSV file with taxonomic information (gzip-compessed csv is also supported). So the tar file with database should contain two files:
test-db-sourmash
├── sourmash-db.zip
└── lineages.csv.gz

For now, file names are hardcoded .

  1. As a first step, sourmash creates FracMinHash sketches (signatures) for each sample. This step is independent of the database, so we need to do sketching only once. Therefore, I removed the database from the input channel (ch_input_for_profiling.sourmash.map). Otherwise, it will perform independent sketching for each database provided and we will have lots of duplicated samples, isn't it?

  2. Sourmash can create 4 types of signatures: DNA, protein, protein translated from DNA, and signatures based on CSV file with locations to genomes/proteomes. The sourmash/sketch module is written to support all these input types. Therefore, it is required to pass extra args to the process. The esieast way is to specify it in the config, e.g.:

process {
    withName: SOURMASH_SKETCH {
        ext.args = "dna --param-string 'k=31,scaled=1000,noabund'"
    }
}

, where the first word in ext.args should be dna, protein, translate, or fromfile.

TO DO list

  • [ ] upload test dataset to nf-core/test-datasets#taxprofiler branch
  • [ ] ...

PR checklist

  • [x] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [x] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • [ ] If necessary, also make a PR on the nf-core/taxprofiler branch on the nf-core/test-datasets repository.
  • [ ] Make sure your code lints (nf-core lint).
  • [ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • [ ] Usage Documentation in docs/usage.md is updated.
  • [ ] Output Documentation in docs/output.md is updated.
  • [ ] CHANGELOG.md is updated.
  • [ ] README.md is updated (including new tool citations and authors/contributors).

vmikk avatar Oct 17 '23 16:10 vmikk

nf-core lint overall result: Failed :x:

Posted for pipeline commit 58c4207

+| ✅ 157 tests passed       |+
!| ❗   3 tests had warnings |!
-| ❌   1 tests failed       |-

:x: Test failures:

  • schema_params - Param run_sourmash from nextflow config not found in nextflow_schema.json

:heavy_exclamation_mark: Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 1.1.2
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-10-27 12:24:37

github-actions[bot] avatar Oct 20 '23 16:10 github-actions[bot]