StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

[WIP] Handle missing values

Open palday opened this issue 6 years ago • 5 comments

When done, this will close #145 and potentially address #17 .

  • [x] Handle missing in continuous variables.
  • [ ] Handle missing in categorical variables.

palday avatar Sep 24 '19 22:09 palday

Codecov Report

Merging #153 into master will decrease coverage by 0.1%. The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #153      +/-   ##
=========================================
- Coverage    83.6%   83.5%   -0.11%     
=========================================
  Files           9       9              
  Lines         494     497       +3     
=========================================
+ Hits          413     415       +2     
- Misses         81      82       +1
Impacted Files Coverage Δ
src/schema.jl 81.08% <100%> (+0.17%) :arrow_up:
src/terms.jl 83.76% <50%> (-0.59%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d03b5c2...a799919. Read the comment docs.

codecov-io avatar Sep 24 '19 22:09 codecov-io

This generally looks good but needs tests. I think the problem I ran into was that copy(::Missing) isn't defined, but modelcols(::ContinuousTerm, ...) uses copy on teh elements of the underlying vector, instead of just writing the elements into a new vector. The idea was that maybe someone has a custom type that is mutable but they define it as continuous (for instance, Survival.jl uses a custom EventTime struct, which isn't mutable but still it's not a stretch), but maybe that's being too defensive.

kleinschmidt avatar Sep 24 '19 23:09 kleinschmidt

Yeah, that's about as far as I had gotten as well when I ran out of steam late local time. deepcopy(::Missing) is defined, and given that missing is a singleton, having copy(m::Missing) = deepcopy(m) seems like a reasonable definition, but probably one that doesn't belong here but rather in Base.

palday avatar Sep 25 '19 05:09 palday

Ignore the horrible hacking of getindex -- that will not stay.

palday avatar Sep 25 '19 20:09 palday

It would be nice if the following

julia> using DataFrames, StatsModels

julia> df = DataFrame(t=1:5, x=(1:5).*2.0, y=1:5.0, z=[12; missing; 16.:2:20])

julia> f = @formula(y ~ x + lag(x) + z + lag(z))

gave this outcome

julia> df_hand0 = transform(df, :z => lag => :z_lagged, :x => lag => :x_lagged)
5×6 DataFrame
 Row │ t      x        y        z          z_lagged   x_lagged  
     │ Int64  Float64  Float64  Float64?   Float64?   Float64?  
─────┼──────────────────────────────────────────────────────────
   1 │     1      2.0      1.0       12.0  missing    missing   
   2 │     2      4.0      2.0  missing         12.0        2.0
   3 │     3      6.0      3.0       16.0  missing          4.0
   4 │     4      8.0      4.0       18.0       16.0        6.0
   5 │     5     10.0      5.0       20.0       18.0        8.0

julia> df_hand = df_hand0[completecases(df_hand0),:] |> disallowmissing!

julia> f_hand = @formula(y ~ x + x_lagged + z + z_lagged)

julia> ff_hand = apply_schema(f_hand, schema(f_hand, df_hand))

julia> modelmatrix(ff_hand, df_hand)
2×4 Matrix{Float64}:
  8.0  6.0  18.0  16.0
 10.0  8.0  20.0  18.0

(If you happen to make this work in this PR, feel free to turn this code into a test.)

greimel avatar Jun 30 '22 13:06 greimel