Bug in handling of `missing` for functions in `@formula` call
julia> using DataFrames, GLM
julia> df = DataFrame(y = rand(100), x = rand(100));
julia> lm(@formula(y ~ x + lag(x)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64
Constructing a lagged variable via df.lag_wage = lag(df.wage) and then running a regression works fine.
It seems like GLM (or StatsModels) has not caught up to the new ability to specify transformations in the @formula call.
yes, that's because the missing is generated by modelcols, which happens after missings are dropped from the underlying table.
This would imply we need another check to drop missings, right?
yes, and this is one of the things that's held up a revamp of the model fitting API in StatsModels.jl (e.g. why we still have ModelFrame/ModelMatrix).
one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings). Then you'd just need one missing removal pass.
The underlying problem is that some terms can introduce missings so you need to do at least one pass after generating the model cols. it seems kind of wasteful to generate a full matrix that's potentially Union{Missing,T} and then require consumers to convert that to a T. There's also the problem of keeping track of which rows of your input table the predictions/fitted values correspond to. I think that's the deeper issue. Right now that's handled by ModelFrame (there's a mask vector that tells you which rows of the input table made it into the model matrix).
one possiblity that's been discussed is to just not worry about dropping missings on the input side, and leave it up to the consumer to handle missings (e.g., allow StatsModels Terms to generate modelcols with missings).
I'm generally empathetic to having the user (or package like GLM) handle missings on their own. But StatsModels dropping missings for everyone that uses it seems like a first-best outcome. There is just too much missing data out there and having each package handle it differently would be really hard. Centralization is good.
Is there any update on this? How this been fixed upstream?
I'm afraid not. It's related to JuliaStats/StatsModels.jl#153 and neither I nor @palday have had much extra bandwidth for that recently.
But if you wanted to make a PR to StatsModels with a proposal for how and where to handle the dropping of missing values created by modelcols that would be a huge help and a good start!
@kleinschmidt I ran into this again today.
Are some of the internal fixes to StatsModels going to be able to resolve this? I know there is work to allow more customized handling of this stuff.
A temporary kludge would be dropping missings from the resultant X in the lm(::FormulaTerm,...) methods before LinearModel(X, ....) is called. That's something that should be a straightforward PR against GLM.jl.
Yeah, that could work. But then we have to allocate that matrix twice. Would be nice to handle this upstream.
Is there just no plan to fix this? The inability to use lag terms in formulas is extremely damaging to the usefulness of this package.