varimpact
varimpact copied to clipboard
Fix GitHub issue #32: Remove inconsistent missing data handling
Summary
This PR fixes GitHub issue #32 by removing the problematic conditional block in vim-factors.R that caused inconsistent missing data handling.
Problem
The bug was in lines 181-186 of R/vim-factors.R:
if (sum(deltat == 0) < 10) {
Yt = Yt[deltat == 1]
At = At[deltat == 1]
Wtsht = Wtsht[deltat == 1, , drop = FALSE]
deltat = deltat[deltat == 1]
}
This created inconsistent behavior:
- < 10 missing values: Cleanup code runs, TMLE receives clean data ✅
- ≥ 10 missing values: Cleanup code doesn't run, TMLE receives dirty data and fails ❌
Root Cause
The arbitrary threshold of 10 missing values caused:
- Inconsistent data preprocessing
- Unpredictable failures based on missing value count
- Violation of the principle that similar inputs should produce similar behavior
Solution
- Removed the entire conditional block (lines 181-186)
- This allows proper delta missingness estimation as intended by the existing TODO comment
- Ensures consistent behavior regardless of the number of missing values
Testing
Added comprehensive tests:
tests/testthat/test-missing-y-bug.R- Formal test suite with 3 test casestest_bug_force.R- Reproduces the exact bug conditiontest_fix_verification.R- Verifies the fix worksTEST_BUG_32_README.md- Detailed documentation and instructions
Test Results
Before fix:
- ✅ < 10 missing: Works (cleanup runs)
- ❌ ≥ 10 missing: Fails (no cleanup, TMLE gets dirty data)
After fix:
- ✅ All cases: Consistent behavior through delta missingness estimation
Files Changed
R/vim-factors.R- Removed problematic conditional blocktests/testthat/test-missing-y-bug.R- Added comprehensive test suiteTEST_BUG_32_README.md- Added documentation- Test files for reproduction and verification
Verification
The fix has been tested with the exact simulation from the GitHub issue:
- 11 missing Y values (the original failing case)
- Multiple boundary conditions (exactly 10, more than 10)
- Verification that the fix maintains consistent behavior
This resolves the issue where varimpact() would fail unpredictably based on the number of missing Y values.
Closes #32