[WIP] Handle missing values
When done, this will close #145 and potentially address #17 .
- [x] Handle missing in continuous variables.
- [ ] Handle missing in categorical variables.
Codecov Report
Merging #153 into master will decrease coverage by
0.1%. The diff coverage is66.66%.
@@ Coverage Diff @@
## master #153 +/- ##
=========================================
- Coverage 83.6% 83.5% -0.11%
=========================================
Files 9 9
Lines 494 497 +3
=========================================
+ Hits 413 415 +2
- Misses 81 82 +1
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/schema.jl | 81.08% <100%> (+0.17%) |
:arrow_up: |
| src/terms.jl | 83.76% <50%> (-0.59%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update d03b5c2...a799919. Read the comment docs.
This generally looks good but needs tests. I think the problem I ran into was that copy(::Missing) isn't defined, but modelcols(::ContinuousTerm, ...) uses copy on teh elements of the underlying vector, instead of just writing the elements into a new vector. The idea was that maybe someone has a custom type that is mutable but they define it as continuous (for instance, Survival.jl uses a custom EventTime struct, which isn't mutable but still it's not a stretch), but maybe that's being too defensive.
Yeah, that's about as far as I had gotten as well when I ran out of steam late local time. deepcopy(::Missing) is defined, and given that missing is a singleton, having copy(m::Missing) = deepcopy(m) seems like a reasonable definition, but probably one that doesn't belong here but rather in Base.
Ignore the horrible hacking of getindex -- that will not stay.
It would be nice if the following
julia> using DataFrames, StatsModels
julia> df = DataFrame(t=1:5, x=(1:5).*2.0, y=1:5.0, z=[12; missing; 16.:2:20])
julia> f = @formula(y ~ x + lag(x) + z + lag(z))
gave this outcome
julia> df_hand0 = transform(df, :z => lag => :z_lagged, :x => lag => :x_lagged)
5×6 DataFrame
Row │ t x y z z_lagged x_lagged
│ Int64 Float64 Float64 Float64? Float64? Float64?
─────┼──────────────────────────────────────────────────────────
1 │ 1 2.0 1.0 12.0 missing missing
2 │ 2 4.0 2.0 missing 12.0 2.0
3 │ 3 6.0 3.0 16.0 missing 4.0
4 │ 4 8.0 4.0 18.0 16.0 6.0
5 │ 5 10.0 5.0 20.0 18.0 8.0
julia> df_hand = df_hand0[completecases(df_hand0),:] |> disallowmissing!
julia> f_hand = @formula(y ~ x + x_lagged + z + z_lagged)
julia> ff_hand = apply_schema(f_hand, schema(f_hand, df_hand))
julia> modelmatrix(ff_hand, df_hand)
2×4 Matrix{Float64}:
8.0 6.0 18.0 16.0
10.0 8.0 20.0 18.0
(If you happen to make this work in this PR, feel free to turn this code into a test.)