StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Incorrect Default Coding of Union{Missing, Int or Float64}

Open clintonTE opened this issue 5 years ago • 8 comments

Pretty straight forward. I would expect the results of these to be the same:

function mwe(N=10)
  df = DataFrame(x = rand(N),
    zm = Vector{Union{Missing, Int}}(rand(1:4,N)),
    z = Vector{Int}(rand(1:4,N)))

  println("W/out missings")
  f = @eval(@formula(x ~ z))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)

  println("With missings")
  f = @eval(@formula(x ~ zm))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)
end

mwe()

OUTPUT:

W/out missings
10×2 Array{Float64,2}:
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  2.0
 1.0  1.0
 1.0  2.0
 1.0  4.0
 1.0  1.0
 1.0  4.0

With missings
10×4 Array{Float64,2}:
 1.0  0.0  1.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  0.0  1.0
 1.0  0.0  1.0  0.0

clintonTE avatar Aug 28 '19 21:08 clintonTE

Note that if you pass ContinuousTerm as a hint to the schema you should get the desired result:

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
10×2 Array{Float64,2}:
 1.0  1.0
 1.0  2.0
 1.0  2.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  3.0
 1.0  1.0

That being said, it might make sense to widen the criterion for whether something is continuous or not to be Union{Missing, <:Real} instead of just <:Real. And leave it up to the user to decide how to handle missings in the model matrix that results...

kleinschmidt avatar Sep 09 '19 18:09 kleinschmidt

Actually, if there are any missing values it will error:

julia> df.zm[2] = missing
missing

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
ERROR: MethodError: no method matching copy(::Missing)

kleinschmidt avatar Sep 09 '19 18:09 kleinschmidt

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]

matthieugomez avatar Sep 09 '19 18:09 matthieugomez

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]

This. A lot of my regression code follows a similar pattern before getting the model matrix. Also worth pointing out that the old pattern of ModelMatrix(ModelFrame(.)) treated Vector{Float64} and Vector{Union{Float64,Missing}} the same (as continuous), at least in the original use case.

Moreover, I really like how StatsModels has had common-sense rules for continuous variables in most reasonable cases. Some intuition for ideal behavior: A typical "raw" dataframe model and formula will involve Dates, Ints, Symbols, Strings, and Float64s, and the various permutations combined with Missing. Except for T<:Union{Number,Missing}, I would expect Dummy Coding to be default and for T<:Union{Number, Missing} I would think it would be Continuous by default. I guess I would prefer that as a bright-line rule than relying on behind the scenes inference based on the assortment and uniqueness of values. (Of course, this implies that if the column in the DataFrame is already designated as a CategoricalValue or similar, that would also be obeyed.)

clintonTE avatar Sep 09 '19 18:09 clintonTE

I think it makes sense to treat Union{Missing, <:Number} as continuous. I'll do a PR later this week or happily review one if someone else wants to make one before then!

kleinschmidt avatar Sep 24 '19 21:09 kleinschmidt

Note that for now

using StatsModels
N = 10_000_000
x = rand(N) + rand([0, missing], N)
df = (x = x,)
schema(Term(:x), df)
#julia(20271,0x116991dc0) malloc: can't allocate region
#:*** mach_vm_map(size=199992320073728, flags: 60000100) failed (error code=3)
#julia(20271,0x116991dc0) malloc: *** set a breakpoint in malloc_error_break to debug
#141 ERROR: OutOfMemoryError(

matthieugomez avatar Apr 23 '20 15:04 matthieugomez

@matthieugomez That's not too surprising given the underlying issue (and how it expands out), but thanks for the example.

palday avatar Apr 23 '20 15:04 palday

yes, just connecting it to https://github.com/FixedEffects/FixedEffectModels.jl/issues/99

matthieugomez avatar Apr 23 '20 15:04 matthieugomez