StatsModels.jl
StatsModels.jl copied to clipboard
Incorrect Default Coding of Union{Missing, Int or Float64}
Pretty straight forward. I would expect the results of these to be the same:
function mwe(N=10)
df = DataFrame(x = rand(N),
zm = Vector{Union{Missing, Int}}(rand(1:4,N)),
z = Vector{Int}(rand(1:4,N)))
println("W/out missings")
f = @eval(@formula(x ~ z))
f = apply_schema(f, schema(f,df), StatisticalModel)
m = modelcols(f.rhs, df)
display(m)
println("With missings")
f = @eval(@formula(x ~ zm))
f = apply_schema(f, schema(f,df), StatisticalModel)
m = modelcols(f.rhs, df)
display(m)
end
mwe()
OUTPUT:
W/out missings
10×2 Array{Float64,2}:
1.0 3.0
1.0 3.0
1.0 2.0
1.0 2.0
1.0 1.0
1.0 2.0
1.0 4.0
1.0 1.0
1.0 4.0
With missings
10×4 Array{Float64,2}:
1.0 0.0 1.0 0.0
1.0 0.0 1.0 0.0
1.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0
1.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0
1.0 1.0 0.0 0.0
1.0 0.0 0.0 1.0
1.0 0.0 1.0 0.0
Note that if you pass ContinuousTerm
as a hint to the schema you should get the desired result:
julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
10×2 Array{Float64,2}:
1.0 1.0
1.0 2.0
1.0 2.0
1.0 3.0
1.0 3.0
1.0 3.0
1.0 3.0
1.0 2.0
1.0 3.0
1.0 1.0
That being said, it might make sense to widen the criterion for whether something is continuous or not to be Union{Missing, <:Real}
instead of just <:Real
. And leave it up to the user to decide how to handle missings in the model matrix that results...
Actually, if there are any missing values it will error:
julia> df.zm[2] = missing
missing
julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
ERROR: MethodError: no method matching copy(::Missing)
Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing}
but does not contain missing values. This is a pretty common situation, for instance:
using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]
Still, the issue is what happens when the variable is
Vector{Union{<:Real, Missing}
but does not contain missing values. This is a pretty common situation, for instance:using DataFrames df = DataFrame(y = [missing, 1, 2]) idx = completecases(df) subdf = df[idx, :]
This. A lot of my regression code follows a similar pattern before getting the model matrix. Also worth pointing out that the old pattern of ModelMatrix(ModelFrame(.))
treated Vector{Float64}
and Vector{Union{Float64,Missing}}
the same (as continuous), at least in the original use case.
Moreover, I really like how StatsModels
has had common-sense rules for continuous variables in most reasonable cases. Some intuition for ideal behavior: A typical "raw" dataframe model and formula will involve Dates, Ints, Symbols, Strings, and Float64s, and the various permutations combined with Missing. Except for T<:Union{Number,Missing}
, I would expect Dummy Coding to be default and for T<:Union{Number, Missing}
I would think it would be Continuous by default. I guess I would prefer that as a bright-line rule than relying on behind the scenes inference based on the assortment and uniqueness of values. (Of course, this implies that if the column in the DataFrame is already designated as a CategoricalValue or similar, that would also be obeyed.)
I think it makes sense to treat Union{Missing, <:Number}
as continuous. I'll do a PR later this week or happily review one if someone else wants to make one before then!
Note that for now
using StatsModels
N = 10_000_000
x = rand(N) + rand([0, missing], N)
df = (x = x,)
schema(Term(:x), df)
#julia(20271,0x116991dc0) malloc: can't allocate region
#:*** mach_vm_map(size=199992320073728, flags: 60000100) failed (error code=3)
#julia(20271,0x116991dc0) malloc: *** set a breakpoint in malloc_error_break to debug
#141 ERROR: OutOfMemoryError(
@matthieugomez That's not too surprising given the underlying issue (and how it expands out), but thanks for the example.
yes, just connecting it to https://github.com/FixedEffects/FixedEffectModels.jl/issues/99