demultiplex icon indicating copy to clipboard operation
demultiplex copied to clipboard

skip over empty fastqs after demultiplex and continue gracefully

Open glichtenstein opened this issue 1 year ago • 11 comments

PR checklist

  • [x] This comment contains a description of changes (with reason).
  • [ ] If you've fixed a bug or added code that should be tested, add tests!
  • [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • [ ] If necessary, also make a PR on the nf-core/demultiplex branch on the nf-core/test-datasets repository.
  • [x] Make sure your code lints (nf-core lint).
  • [x] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • [ ] Usage Documentation in docs/usage.md is updated.
  • [ ] Output Documentation in docs/output.md is updated.
  • [x] CHANGELOG.md is updated.
  • [ ] README.md is updated (including new tool citations and authors/contributors).

Title of PR:

"Enhanced Resilience to Empty or Invalid FASTQ Files with Logging Functionality"

Description:

Changes Made:

  1. Introduction of a Log File:

    • Implemented a global log file (${params.outdir}/invalid_fastqs.log) for tracking empty or invalid FASTQ files, enhancing post-run analysis.
  2. Modified generate_fastq_meta Function:

    • Updated generate_fastq_meta to gracefully handle empty or invalid FASTQ files, preventing pipeline interruption.
    • Added logFile as a new parameter for logging these occurrences.
  3. New Function - appendToLogFile:

    • Introduced appendToLogFile, a utility function for appending messages to the log file. This function improves maintainability and includes a check to handle Groovy GStrings.

Reason for Changes:

  • Improving Pipeline Robustness: Aims to bolster the pipeline's resilience against edge cases like empty or invalid FASTQ files.
  • Enhanced Debugging and Transparency: Offers a clear record of skipped files, aiding in debugging and ensuring data quality.
  • Maintainability and Code Organization: The addition of appendToLogFile and the parameterization of logFile align with best coding practices, enhancing code reusability and organization.

Impact:

  • These enhancements bolster the pipeline's robustness without altering its core functionality, ensuring a smooth experience when encountering empty or invalid FASTQ files.

Use Cases Addressed:

  1. Incorrect Barcode in Sample Sheet:

    • In cases where an incorrect set of barcodes is provided for a sample, resulting in an empty FASTQ file, the pipeline previously halted. This was problematic when processing multiple samples (e.g., 99 valid samples and 1 with an incorrect barcode), as valid FASTQ files would remain in the workdir and not be moved to the outdir. With the proposed changes, the pipeline will continue to the end, allowing for the release of data for valid samples and logging the problematic sample for further investigation.
  2. Library Preparation Errors:

    • Instances where library preparation for a specific sample is faulty can also lead to empty FASTQ files. The updated pipeline will now allow for continuous processing to the end, enabling users to utilize MultiQC outputs to identify which samples were problematic. The invalid_fastqs.log file will assist in pinpointing these samples for reprocessing or further laboratory investigation.

glichtenstein avatar Jan 12 '24 20:01 glichtenstein

nf-core lint overall result: Failed :x:

Posted for pipeline commit 6c63769

+| ✅ 157 tests passed       |+
#| ❔   2 tests were ignored |#
!| ❗   4 tests had warnings |!
-| ❌  10 tests failed       |-

:x: Test failures:

  • files_exist - File must be removed: lib/nfcore_external_java_deps.jar
  • files_unchanged - .github/workflows/branch.yml does not match the template
  • files_unchanged - .github/workflows/linting_comment.yml does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - assets/email_template.html does not match the template
  • files_unchanged - assets/email_template.txt does not match the template
  • files_unchanged - assets/nf-core-demultiplex_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-demultiplex_logo_light.png does not match the template
  • files_unchanged - docs/images/nf-core-demultiplex_logo_dark.png does not match the template
  • files_unchanged - pyproject.toml does not match the template

:heavy_exclamation_mark: Test warnings:

  • pipeline_todos - TODO string in README.md: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowDemultiplex.groovy: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

:grey_question: Tests ignored:

:white_check_mark: Tests passed:

Run details

  • nf-core/tools version 2.12.1
  • Run at 2024-02-02 18:54:59

github-actions[bot] avatar Jan 12 '24 20:01 github-actions[bot]


Question for Reviewers about setting up a test for the PR: In our in-house testing using BCLConvert, we have a SampleSheet.csv where one of the samples has wrong barcodes assigned (both Index and Index2). This leads to the generation of empty FASTQ files for that specific sample, which is one of the scenarios this PR aim to address.

However, for the nf-core GitHub tests, we are currently using this sample sheet for testing demultiplex: flowcell_input.csv. This sample sheet doesn't include any incorrect indices that would result in empty FASTQ files.

My question is: Would it be acceptable to modify this standard sample sheet so that it includes a sample (sample2) with incorrect indices? This modification would enable us to test the pipeline's ability to handle empty FASTQ files in a controlled and predictable manner, which is crucial for validating the new features.

Perhaps, we could also modify one of these with a s25 and add some random barcodes that will yield empty s25 fastqs:

  1. fqtk_samplesheet.csv
  2. sgdemux-samplesheet.csv

Your guidance on this matter would be greatly appreciated, as it will help ensure that our tests are comprehensive and reflective of real-world scenarios where such issues might arise.


glichtenstein avatar Jan 24 '24 21:01 glichtenstein

Thanks for the PR!

Gonna wait for @matthdsm on this one.

In the meantime could you squash some of the commits? There's a lot of them that are linting or debugging, or if you'd like we can just squash the whole PR, there's not many changes made. 89 commits for 96 lines of code just seems like a lot.

edmundmiller avatar Jan 31 '24 17:01 edmundmiller

@glichtenstein about the test data, yes I think it would make a lot of sense to have a samplesheet with faulty barcodes. Maybe it would be best to just create a new samplesheet so we have two test cases, one for a normal run and one with faulty barcodes.

I've been working on setting up a test dataset from a NovaSeq6000, would be nice if we could have second samplesheet there too (see https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/NovaSeq6000)

Aratz avatar Feb 02 '24 09:02 Aratz

Thanks for the PR!

Gonna wait for @matthdsm on this one.

In the meantime could you squash some of the commits? There's a lot of them that are linting or debugging, or if you'd like we can just squash the whole PR, there's not many changes made. 89 commits for 96 lines of code just seems like a lot.

Thanks @edmundmiller ! Very good idea, I was not aware that one could squash commits, I will proceed with an interactive rebase, thus squashing all the commits into one that is more descriptive. Thanks for the heads-up.

glichtenstein avatar Feb 02 '24 14:02 glichtenstein

LGTM!

Thanks for the review @matthdsm !

glichtenstein avatar Feb 02 '24 14:02 glichtenstein

@glichtenstein about the test data, yes I think it would make a lot of sense to have a samplesheet with faulty barcodes. Maybe it would be best to just create a new samplesheet so we have two test cases, one for a normal run and one with faulty barcodes.

I've been working on setting up a test dataset from a NovaSeq6000, would be nice if we could have second samplesheet there too (see https://github.com/nf-core/test-datasets/blob/demultiplex/testdata/NovaSeq6000)

@Aratz Thanks for the reviews! Yes, that is a very good idea, I will get this dataset, and do some local testing with faulty barcodes, then I could open a branch for the NovaSeq6000 SampleSheet.csv. Do you rather have two SampleSheets? like a SampleSheet_faulty.csv? or would you prefer that I add a new sample row for the current SampleSheet.csv?

glichtenstein avatar Feb 02 '24 14:02 glichtenstein

I think it's great to be able to complete the pipeline even with faulty barcodes! However, sometimes we might want the pipeline to crash to avoid risking this issue going under the radar. Maybe we could have an extra parameter e.g. skip_empty_fastq ? (could be true by default).

This is a very good suggestion, we thought of it but it was going to require more business logic, but definitely it will be a good addition. I will explore options and come back to you. In this regards, shall I merge this PR into dev and then open a new branch for it, or would you suggest I keep working on the '--skip_empty_fastq' in the same PR?

Also, if I'm not mistaken subworkflows/nf-core/bcl_demultiplex/main.nf is automatically imported from https://github.com/nf-core/modules, so your PR should be made against that repository first.

Oh, I was not aware of that relation, I will take a look, and make the PR againt https://github.com/nf-core/modules if possible.
Thanks a lot.

glichtenstein avatar Feb 02 '24 14:02 glichtenstein

@Aratz I have opened a PR in modules 🤞

glichtenstein avatar Feb 02 '24 21:02 glichtenstein

Thanks for the PR!

Gonna wait for @matthdsm on this one.

In the meantime could you squash some of the commits? There's a lot of them that are linting or debugging, or if you'd like we can just squash the whole PR, there's not many changes made. 89 commits for 96 lines of code just seems like a lot.

Hi @edmundmiller, after the rebase to squash all the commits, I do not pass the nf-core inting, do you think something has changed with the tests?

glichtenstein avatar Feb 02 '24 21:02 glichtenstein

@Aratz Thanks for the reviews! Yes, that is a very good idea, I will get this dataset, and do some local testing with faulty barcodes, then I could open a branch for the NovaSeq6000 SampleSheet.csv. Do you rather have two SampleSheets? like a SampleSheet_faulty.csv? or would you prefer that I add a new sample row for the current SampleSheet.csv?

Yes I think it would be best with two samplesheets (there is one in the .tar.gz file so you might need to upload a new one too. I thinking both datasets can sit in the Novaseq6000 folder, and you can document why they differ in the README file.

This is a very good suggestion, we thought of it but it was going to require more business logic, but definitely it will be a good addition. I will explore options and come back to you. In this regards, shall I merge this PR into dev and then open a new branch for it, or would you suggest I keep working on the '--skip_empty_fastq' in the same PR?

I'd say you can close this one and open a new one with the automatic module update and work on the new parameter there :+1:

Oh, I was not aware of that relation, I will take a look, and make the PR againt https://github.com/nf-core/modules if possible. Thanks a lot.

Great! I'll check it out!

Aratz avatar Feb 05 '24 07:02 Aratz

Closing this PR, it was opened on nf-core/modules instead: https://github.com/nf-core/modules/pull/4842

glichtenstein avatar May 08 '24 16:05 glichtenstein