docs
docs copied to clipboard
Reviews
Please post below to request a review.
TODO
- [ ] some example workflow
I wrote an example RNA-seq pipeline for generating gene expression counts here: https://github.com/jrderuiter/snakemake-rnaseq-star-featurecounts. Would be interested to hear your thoughts/comments concerning the structure/code. I've used wrappers where available, though wrappers for the other tools should be easy to implement.
This looks very good! Besides some very minor things, it is exactly like a perfect snakemake workflow should be. Do you intend to move this over here?
- At least in one file, there is a superfluous path.join.
- We should indeed aim to have a wrapper for all common tools.
- Rules that have a wrapper don't need to be included in the conda environment. When using snakemake with --use-conda, wrappers automatically get their own env. For rules that are too special for a wrapper, you can use the conda directive. See here.
- I am not sure yet what should be the canonical way to configure the fastq files. If the path to the fastq files was stored in the samples.tsv, they can have non-standard names or paths anywhere in the system. The upside of requiring a fixed relative path like you have it is that everything could be automatically made sustainable and portable via snakemake --archive and zenodo, see here. However, for publication, one might rather want to point to fastqs via an URL to files in sra. It would be nice to support a simple transition between local and sra, e.g. via the config file.
Thanks! Sure, moving here would be fine. I'll have a look at your comments (especially point 4, which I agree would be ideal if this would also accept URLs and such).
One other thing I was thinking about, is the flexibility of the output paths. For most of my projects, I use the cookiecutter data science template (https://github.com/drivendata/cookiecutter-data-science) as a starting point, which I think provides a nice default project structure. One nice thing about this template, is that it splits interim and 'final' processed files into two separate directories (data/interim and data/processed). I like this a lot, because it distinguishes interim steps (that data from which can be discarded) from the final processed outputs.
In my own pipelines, I therefore typically distinguish between the interim and processed directories (which I base all rule paths on and are configurable via the config file/command line). For example, my snakemake calls typically look something like this:
snakemake -s pipelines/exome.snake \
--configfile ./configs/exome.yml \
--config samples=data/raw/samples.exome.txt \
interim_dir=data/interim/exome \
processed_dir=data/processed/exome \
qc_dir=qc/exome \
log_dir=logs/exome
I think it would be great if we could include some flexibility in the output paths, but am unsure what the best way is to do so. To some extent this can be done at runtime by changing the workdir for snakemake, but this does not allow us to distinguish between interim and processed output directories. I also don't want to introduce too much extra complexity into the pipelines.
What are your thoughts on this?
Maybe a stupid question, but how are software versions managed with wrappers and conda? Are versions of tools hardcoded in the wrapper, or can I specify which version of the tool should be used?
By default, the wrapper uses a hardcoded conda environment (see here). By this, we ensure stability. If this shall be overridden, the user can specify a custom conda environment within the rule that uses the wrapper. This will always take precedence over the one that comes with the wrapper.
Regarding the directories:
While I also like the clean distinction, I think the datascience approach of interim and final directories is a bit too unflexible for Snakemake. With the support of temp files, you don't really need to distinguish output files like that in Snakemake. Interim files can simply be declared as temporary. Also, users of our workflows might have different ideas of what is interim. Sometimes, a bam is needed for further processing or even a result of its own, sometimes it can be discarded. Since the idea would be that people checkout the repo and possibly modify the workflow, we probably should use neutral directory names that rather describe the workflow step than what kind of usage the file will have.
@jrderuiter is your workflow ready for moving here? It does not need to be perfect, we can mark it as WIP.
I've been developing a workflow for metagenomic projects located here: https://github.com/NBISweden/nbis-meta. It supports paired- and single-end data and includes preprocessing, read-based classification, assembly and binning as well as functional and taxonomic annotations. Any input is appreciated. Thanks!
Hello, I would like to contribute my workflow to the snakemake-workflows. Could you add me to the team?
Hey @johne and @SilasK sorry for being so late about this. I completely missed the notifications. Very nice pipelines! Both of them could be included here relatively quickly I think.
@johnne:
- I'd vote for writing down filepaths without the
opjhelper. It limits readability compared to plain strings.
@SilasK
- I'd suggest to move all conda packages from dependencies.yml into rule-specific environments. This improves provenance information in the reports and makes it easier to discover which software is used in which step.
- Please consider annotation final results with the
report()flag, such that a self-contained report can be created from your pipeline. - Please move testing from travis to github actions. This makes it more consistent with the other workflows in here, and also more future proof.
Both
- If you haven't yet, please run
snakemake --linton your pipelines.
Thanks @johanneskoester!
Yes the opj code is an old leftover that I've been putting off removing for a long time, but I completely agree. Will get to work on that.
I've run linting on the workflow which did help a lot (thanks for that functionality!) but I'm afraid the workflow doesn't conform 100% to the linting. Mostly it's missing log directive for rules using the script: directive. Will work on updating that.
Hey! I've tried my hand at setting up a DeepVariant-GLnexus workflow. Would love to have it as a part of this repo. It has unit and integration tests. Genome fetching, mapping, etc is done similarly to the dna-seq-gatk-variant-calling workflow. I'm planning on moving this to the copier template system as well.
With the latest update to the GATK workflow - fixing paths etc - this workflow could in turn probably also be significantly refactored to draw more on the GATK one.
Hi @johanneskoester , I would like to add our pipeline into the snakemake catalog. I am not entirely certain if it meets your criteria though. For example Snakefile is not directly under workflow folder, the reason is that it provides more than a single pipeline and it is a collection of several pipelines. Do you have any feedback on how to get it to be added there?
您好,您发的邮件我已收到。谢谢! 支永强