enhancement: add parameter for fastp to make writing of trimmed fastqs optional
Closes #2700. with this PR, by defining the input channel save_trimmed_pass a user can define whether trimmed fastqs are written/saved, or not. This works for all three modes: interleaved, single_end, paired_end. This will help to save on storage when trimming is not necessarily desired but the qc provided by fastp is.
PR checklist
- [x] This comment contains a description of changes (with reason).
- [x] If you've fixed a bug or added code that should be tested, add tests!
- [x] If you've added a new tool - have you followed the module conventions in the contribution docs
- [x] If necessary, include test data in your PR.
- [x] Remove all TODO statements.
- [x] Emit the
versions.ymlfile. - [x] Follow the naming conventions.
- [x] Follow the parameters requirements.
- [x] Follow the input/output options guidelines.
- [x] Add a resource
label - [x] Use BioConda and BioContainers if possible to fulfil software requirements.
- Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
- [x]
PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware - [x]
PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware - [x]
PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
- [x]
@maxulysse just wondering if there's any update for moving forward with this PR. I have addressed or responded to each of your review comments, is there anything else i need to do for this PR to be ready?
sorry, maybe I am not fully understanding the issue. so in the case you are describing no output will be written, just the json and log files?
@FriederikeHanssen sorry i did not see your reply until now. yes, json/html and log files for QC purposes. this would be for those who want the QC of fastp but who want to avoid saving large intermediate files because they want to skip actually trimming their reads. this is a mode that fastp supports and mentions in their documentation: https://github.com/OpenGene/fastp#input-and-output
@Emiller88 so do we want to merge, and rewrite all tests in nf-test?
This seems very broken with a lot merge conflict markers... It would need cleaning up ...
Can I get some eyes on this now. I rewrote the tests, as I think they had been written at the start of our nf-test use. I would like to test that the optional channels are indeed empty, but couldn't seem to manage that correctly at the moment. I don't think we need to check the contents of the other fastq files, and they are currently not working for me (commented out lines in the tests)
I guess this is ok but I find that default behaviour is sub optimal.
The primary purpose of fastp is to do the read clean up, so by default I would expect to get those resulting reads.
I think just generating the QC and not using the FASTQ files is a rather odd edge case (imo) and would rather have that behaviour opt in....
I guess this is ok but I find that default behaviour is sub optimal.
The primary purpose of fastp is to do the read clean up, so by default I would expect to get those resulting reads.
I think just generating the QC and not using the FASTQ files is a rather odd edge case (imo) and would rather have that behaviour opt in....
I guess you mean that if you leave the channel empty then it will default to false? But the developer has to look at what the channels are to be able to plug the module in.
I'm running the demultiplex pipeline at the moment, and would like to get the QC report from fastp without the actual read trimming, because I'm going to pass them into a proprietary pipeline that is going to trim it anyway. Fastq
I guess this is ok but I find that default behaviour is sub optimal.
The primary purpose of fastp is to do the read clean up, so by default I would expect to get those resulting reads.
I think just generating the QC and not using the FASTQ files is a rather odd edge case (imo) and would rather have that behaviour opt in....
I guess you mean that if you leave the channel empty then it will default to false? But the developer has to look at what the channels are to be able to plug the module in.
I'm running the demultiplex pipeline at the moment, and would like to get the QC report from fastp without the actual read trimming, because I'm going to pass them into a proprietary pipeline that is going to trim it anyway. Fastq
Actually no, more switch it to more like dontsave_trimmed. Agree empty channel is not a nice system, better to be an 'active' choice
Yeah I understand the justification, but I maintain most people would not be using a proprietary or alternative trimmer. The core Fastp purpose is to do the read clean up, it's not just a QC reporter tool like fastqc
I would go for a negative argument (skip or dontsave) if it wasn't for the fact the other two channels are positive (save).
I don't see a large burden here on the developer, if they are passing reads in with four other empty channnels without looking at the module code/meta then they'll quickly discover it isn't making any reads.
Hmmm... I dunno. I'm still not convinced. Maybe you can ask for a vote on #team-maintainers
What about changing the parameter names to write_ instead of save_?
What about discard_fastq?
Following discussion we are going to change from save_trimmed_pass to discard_trimmed_pass. Then running with false, false, false will provide expected behaviour.