mag
mag copied to clipboard
Optional skipping of short-read input to Filtlong for large datasets
Description of the bug
Hi all - long time listener first time caller:
I have a rather large set of Illumina data along with some nanopore reads on which I was trying to run the hybrid assembly option. After 10+ hours, filtlong was still processing the nanopore reads. I did some digging and the current command utilizes the short-read data as part of the reference option. I think that is fine for small-ish datasets but seems impractical for larger ones.
Once I edited the filtlong.nf code to no longer use the short-reads, the filtlong process took less than 5 minutes and the pipeline has proceeded as expected. Maybe there could be a flag to turn on/off that feature?
filtlong.nf:
process FILTLONG {
tag "$meta.id"
conda "bioconda::filtlong=0.2.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
'biocontainers/filtlong:0.2.0--he513fc3_3' }"
input:
tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)
output:
tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
path "versions.yml" , emit: versions
script:
"""
filtlong \
-1 ${short_reads_1} \
-2 ${short_reads_2} \
--min_length ${params.longreads_min_length} \
--keep_percent ${params.longreads_keep_percent} \
--trim \
--length_weight ${params.longreads_length_weight} \
${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz
cat <<-END_VERSIONS > versions.yml
"${task.process}":
filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
END_VERSIONS
"""
}
Edited working solution:
process FILTLONG {
tag "$meta.id"
conda "bioconda::filtlong=0.2.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
'biocontainers/filtlong:0.2.0--he513fc3_3' }"
input:
tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)
output:
tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
path "versions.yml" , emit: versions
script:
"""
filtlong \
--min_length ${params.longreads_min_length} \
--keep_percent ${params.longreads_keep_percent} \
--length_weight ${params.longreads_length_weight} \
${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz
cat <<-END_VERSIONS > versions.yml
"${task.process}":
filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
END_VERSIONS
"""
}
Command used and terminal output
No response
Relevant files
No response
System information
No response
Hi @ddomman !
Thanks for this! This is great you have a solution already :)
Within the module we could make it optional by inserting the short_reads1/2 into the ocmmand if supplied, something along the lines of:
process FILTLONG {
tag "$meta.id"
conda "bioconda::filtlong=0.2.0"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
'biocontainers/filtlong:0.2.0--he513fc3_3' }"
input:
tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)
output:
tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
path "versions.yml" , emit: versions
script:
def sr_command = short_reads_1 ? "-1 ${short_reads_1} -2 ${short_reads_2} \\" : ""
"""
filtlong \
${sr_command}
--min_length ${params.longreads_min_length} \
--keep_percent ${params.longreads_keep_percent} \
--length_weight ${params.longreads_length_weight} \
${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz
cat <<-END_VERSIONS > versions.yml
"${task.process}":
filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
END_VERSIONS
"""
}
If you want to contribute to the module and pipeline, you can make these changes via a PR to nf-core/moduels, and we can update in the pipeline (With credit to you!) - the contributions will be gratefully recieved :)
Note that @muabnezor is currently in the process of overhauling the long-read/nanopore preprocessing tools anyway, we just merged into the dev branch porechop_abi as a faster replacment for porechop and next we plan to add nanoq as an alternative to Filtlong. So if you prefer that, you could wait for that instead
That said, I think updating filtlong would still be very helpful to the community as a whole. Let me know what you think!
I think this might be handled in the PR mentioned above, but IU need to check
@muabnezor was this addressed in the merged PR, I can't remember now 🤔