gatk-sv icon indicating copy to clipboard operation
gatk-sv copied to clipboard

Removed pairwise batch effect checks from WDL workflow

Open shadizaheri opened this issue 7 months ago • 0 comments

Description:

In this PR, I made significant changes to the WDL workflow responsible for analyzing batch effects in genomic data. Our primary objective was to simplify and optimize the pipeline by removing the pairwise batch effect checks.

Changes Made:

  1. Removed Pairwise Batch Effect Checks: Previously, the workflow considered two types of batch effects: one that compared a single batch against all other batches ("1 vs. all") and another that performed pairwise comparisons of each batch against every other batch. We decided to focus solely on the "1 vs. all" checks and remove the pairwise comparisons to streamline the analysis.
  2. Cleaned Up Redundant Tasks: As part of this update, we also removed tasks and scatter operations related specifically to the pairwise checks, further simplifying the code.

A high-level overview of the changes I implemented, highlighting the differences between this version and eph_turn_off_unstable_af_filter branch:

  • I removed MakeBatchPairsList Task: This task generates a list of batch pairs for pairwise comparison. Since I don't want pairwise comparisons, I removed this task.

  • I removed the Scatter Blocks for Pairwise Comparisons: The scatter block that calls helper.check_batch_effects for each pair in batch_pairs were removed.

  • I removed MergeVariantFailureLists Call for Pairwise Checks: The call to MergeVariantFailureLists as merge_pairwise_checks that collects results from pairwise batch effect detection was removed.

  • I adjusted the MakeReclassificationTable Task: This task takes inputs from both pairwise and one-vs-all checks. Since I removed the pairwise checks, I also removed the pairwise_fails input and any related logic inside the task that deals with pairwise comparison results.

  • I removed any references to Pairwise outputs: Any output or logic that depends on the results of the pairwise comparisons was removed.

  • I modified the make_batch_effect_reclassification_table.PCRMinus_only.R script to reflect these changes.

Rationale:

Our rationale for these changes was multifaceted:

  • Efficiency: Pairwise checks can be computationally intensive, especially when dealing with a large number of batches. By focusing on the "1 vs. all" approach, we can get a broader view of batch effects without the overhead of numerous pairwise comparisons.
  • Simplicity: Reducing the complexity of the workflow makes it easier to understand, maintain, and troubleshoot.
  • Focus on Broad Effects: The "1 vs. all" checks provide a holistic view of how a particular batch compares to the general trend across all batches. This approach can highlight more substantial, systemic batch effects rather than the nuanced differences between individual batches.

By implementing these changes, we aim to provide a more streamlined, efficient, and intuitive workflow for analyzing batch effects.

Tests I have tested on the recent updates using the batch list and datasets located in the Phase 1 workspace. For those who have the phase 1 AoU permissions, the test results can be viewed at the following Job Manager Results.

shadizaheri avatar Nov 09 '23 20:11 shadizaheri