varimpact icon indicating copy to clipboard operation
varimpact copied to clipboard

Fix GitHub issue #32: Remove inconsistent missing data handling

Open ck37 opened this issue 4 months ago • 0 comments

Summary

This PR fixes GitHub issue #32 by removing the problematic conditional block in vim-factors.R that caused inconsistent missing data handling.

Problem

The bug was in lines 181-186 of R/vim-factors.R:

if (sum(deltat == 0) < 10) {
  Yt = Yt[deltat == 1]
  At = At[deltat == 1]
  Wtsht = Wtsht[deltat == 1, , drop = FALSE]
  deltat = deltat[deltat == 1]
}

This created inconsistent behavior:

  • < 10 missing values: Cleanup code runs, TMLE receives clean data ✅
  • ≥ 10 missing values: Cleanup code doesn't run, TMLE receives dirty data and fails ❌

Root Cause

The arbitrary threshold of 10 missing values caused:

  1. Inconsistent data preprocessing
  2. Unpredictable failures based on missing value count
  3. Violation of the principle that similar inputs should produce similar behavior

Solution

  • Removed the entire conditional block (lines 181-186)
  • This allows proper delta missingness estimation as intended by the existing TODO comment
  • Ensures consistent behavior regardless of the number of missing values

Testing

Added comprehensive tests:

  • tests/testthat/test-missing-y-bug.R - Formal test suite with 3 test cases
  • test_bug_force.R - Reproduces the exact bug condition
  • test_fix_verification.R - Verifies the fix works
  • TEST_BUG_32_README.md - Detailed documentation and instructions

Test Results

Before fix:

  • ✅ < 10 missing: Works (cleanup runs)
  • ❌ ≥ 10 missing: Fails (no cleanup, TMLE gets dirty data)

After fix:

  • ✅ All cases: Consistent behavior through delta missingness estimation

Files Changed

  • R/vim-factors.R - Removed problematic conditional block
  • tests/testthat/test-missing-y-bug.R - Added comprehensive test suite
  • TEST_BUG_32_README.md - Added documentation
  • Test files for reproduction and verification

Verification

The fix has been tested with the exact simulation from the GitHub issue:

  • 11 missing Y values (the original failing case)
  • Multiple boundary conditions (exactly 10, more than 10)
  • Verification that the fix maintains consistent behavior

This resolves the issue where varimpact() would fail unpredictably based on the number of missing Y values.

Closes #32

ck37 avatar Sep 10 '25 04:09 ck37