mag Optional skipping of short-read input to Filtlong for large datasets

Description of the bug

Hi all - long time listener first time caller:

I have a rather large set of Illumina data along with some nanopore reads on which I was trying to run the hybrid assembly option. After 10+ hours, filtlong was still processing the nanopore reads. I did some digging and the current command utilizes the short-read data as part of the reference option. I think that is fine for small-ish datasets but seems impractical for larger ones.

Once I edited the filtlong.nf code to no longer use the short-reads, the filtlong process took less than 5 minutes and the pipeline has proceeded as expected. Maybe there could be a flag to turn on/off that feature?

filtlong.nf:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        -1 ${short_reads_1} \
        -2 ${short_reads_2} \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --trim \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Edited working solution:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    """
    filtlong \
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

Command used and terminal output

No response

Relevant files

No response

System information

No response

Oct 14 '24 00:10 ddomman

Hi @ddomman !

Thanks for this! This is great you have a solution already :)

Within the module we could make it optional by inserting the short_reads1/2 into the ocmmand if supplied, something along the lines of:

process FILTLONG {
    tag "$meta.id"

    conda "bioconda::filtlong=0.2.0"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/filtlong:0.2.0--he513fc3_3' :
        'biocontainers/filtlong:0.2.0--he513fc3_3' }"

    input:
    tuple val(meta), path(long_reads), path(short_reads_1), path(short_reads_2)

    output:
    tuple val(meta), path("${meta.id}_lr_filtlong.fastq.gz"), emit: reads
    path "versions.yml"                                     , emit: versions

    script:
    def sr_command = short_reads_1 ? "-1 ${short_reads_1} -2 ${short_reads_2} \\" : ""
    """
    filtlong \
        ${sr_command}
        --min_length ${params.longreads_min_length} \
        --keep_percent ${params.longreads_keep_percent} \
        --length_weight ${params.longreads_length_weight} \
        ${long_reads} | gzip > ${meta.id}_lr_filtlong.fastq.gz

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        filtlong: \$(filtlong --version | sed -e "s/Filtlong v//g")
    END_VERSIONS
    """
}

If you want to contribute to the module and pipeline, you can make these changes via a PR to nf-core/moduels, and we can update in the pipeline (With credit to you!) - the contributions will be gratefully recieved :)

Note that @muabnezor is currently in the process of overhauling the long-read/nanopore preprocessing tools anyway, we just merged into the dev branch porechop_abi as a faster replacment for porechop and next we plan to add nanoq as an alternative to Filtlong. So if you prefer that, you could wait for that instead

That said, I think updating filtlong would still be very helpful to the community as a whole. Let me know what you think!

Oct 14 '24 07:10 jfy133

I think this might be handled in the PR mentioned above, but IU need to check

Nov 28 '24 13:11 jfy133

@muabnezor was this addressed in the merged PR, I can't remember now 🤔

Jun 06 '25 12:06 jfy133

mag mag copied to clipboard

Optional skipping of short-read input to Filtlong for large datasets

Description of the bug

Command used and terminal output

Relevant files

System information

mag
mag copied to clipboard