GLM.jl icon indicating copy to clipboard operation
GLM.jl copied to clipboard

Bug in handling of `missing` for functions in `@formula` call

Open pdeffebach opened this issue 5 years ago • 12 comments

julia> using DataFrames, GLM

julia> df = DataFrame(y = rand(100), x = rand(100));

julia> lm(@formula(y ~ x + lag(x)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

Constructing a lagged variable via df.lag_wage = lag(df.wage) and then running a regression works fine.

It seems like GLM (or StatsModels) has not caught up to the new ability to specify transformations in the @formula call.

pdeffebach avatar Mar 31 '20 16:03 pdeffebach

yes, that's because the missing is generated by modelcols, which happens after missings are dropped from the underlying table.

kleinschmidt avatar Mar 31 '20 19:03 kleinschmidt

This would imply we need another check to drop missings, right?

pdeffebach avatar Mar 31 '20 20:03 pdeffebach

yes, and this is one of the things that's held up a revamp of the model fitting API in StatsModels.jl (e.g. why we still have ModelFrame/ModelMatrix).

one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings). Then you'd just need one missing removal pass.

kleinschmidt avatar Mar 31 '20 21:03 kleinschmidt

The underlying problem is that some terms can introduce missings so you need to do at least one pass after generating the model cols. it seems kind of wasteful to generate a full matrix that's potentially Union{Missing,T} and then require consumers to convert that to a T. There's also the problem of keeping track of which rows of your input table the predictions/fitted values correspond to. I think that's the deeper issue. Right now that's handled by ModelFrame (there's a mask vector that tells you which rows of the input table made it into the model matrix).

kleinschmidt avatar Mar 31 '20 21:03 kleinschmidt

one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings).

I'm generally empathetic to having the user (or package like GLM) handle missings on their own. But StatsModels dropping missings for everyone that uses it seems like a first-best outcome. There is just too much missing data out there and having each package handle it differently would be really hard. Centralization is good.

pdeffebach avatar Apr 02 '20 14:04 pdeffebach

Is there any update on this? How this been fixed upstream?

pdeffebach avatar Jul 09 '20 01:07 pdeffebach

I'm afraid not. It's related to JuliaStats/StatsModels.jl#153 and neither I nor @palday have had much extra bandwidth for that recently.

kleinschmidt avatar Jul 09 '20 16:07 kleinschmidt

But if you wanted to make a PR to StatsModels with a proposal for how and where to handle the dropping of missing values created by modelcols that would be a huge help and a good start!

kleinschmidt avatar Jul 09 '20 16:07 kleinschmidt

@kleinschmidt I ran into this again today.

Are some of the internal fixes to StatsModels going to be able to resolve this? I know there is work to allow more customized handling of this stuff.

pdeffebach avatar Nov 17 '21 14:11 pdeffebach

A temporary kludge would be dropping missings from the resultant X in the lm(::FormulaTerm,...) methods before LinearModel(X, ....) is called. That's something that should be a straightforward PR against GLM.jl.

palday avatar Nov 18 '21 17:11 palday

Yeah, that could work. But then we have to allocate that matrix twice. Would be nice to handle this upstream.

pdeffebach avatar Nov 19 '21 19:11 pdeffebach

Is there just no plan to fix this? The inability to use lag terms in formulas is extremely damaging to the usefulness of this package.

i-thot-u-wanted2dance avatar Sep 24 '23 14:09 i-thot-u-wanted2dance