StatsModels.jl
StatsModels.jl copied to clipboard
Implementing First-Difference
I am trying to make a transformation à la PolyTerm
. However, I fail to get the modelcols
correctly when the term is a CategoricalTerm
. It seems to parse correctly, but the expansion only does à la ContinousTerm
.
using DataFrames, StatsBase, StatsModels
data = DataFrame(y = rand(10), x = rand(10), z = categorical(repeat(1:2, 5)))
formula = @formula(Δ(y) ~ Δ(x) + Δ(z))
Δ(obj::AbstractVector) = diff(obj)
Δ(obj::AbstractCategoricalVector) = obj[2:end]
struct FDTerm{T} <: AbstractTerm
term::T
end
function StatsModels.apply_schema(t::FunctionTerm{typeof(Δ)}, sch)
term = apply_schema(t.args_parsed[1], sch)
FDTerm(term)
end
function StatsModels.modelcols(t::FDTerm, d::NamedTuple)
modelcols(t.term, d)
end
sc = apply_schema(formula, schema(data))
modelcols(sc, data)
apply_schema(sc.rhs.terms[2], schema(data))
The way you've defined it here modelcols
is just returning the modelcols for the wrapped term. You need something like
StatsModels.modelcols(t::FDTerm, d::NamedTuple) = Δ(t.term, d)
Plus something like
using Tables: ColumnTable
Δ(obj::ContinuousTerm, d::ColumnTable) = Δ(getproperty(d, obj.name))
I'm not sure what the behavior you want is for a categorical term though, or even what would be reasonable...
I will try it out and get back. For categorical in order to match length it would just be Δ(obj::AbstractCategoricalVector) = obj[2:end]
. Might do the Union
with the subarray catvals in case of subdataframes too.
But that doesn't generate the actual numerical columns. I guess you could do
Δ(obj::CategoricalTerm, d::ColumnTable) = obj.contrasts[getproperty(d, obj.name)[2:end]]
which is basically what modelcols
does for a categorical term, just using 2:end
. But this seems like a bad strategy (or at least fragile) since it assumes that every term is wrapped in a Δ
...and if you're going to make that assumption you might as well require that people call it like Δ(y ~ 1 + a + b)
I have been successful at implementing the continuous AbstractVector
terms, but still can't get it to work from AbstractCategoricalVector
. I can't use StatsModels.modelcols(t::MyTerm, d::NamedTuple) = magic(t.term, d)
because of broadcast coming to hunt me. Therefore, I need to use the modelcols(ft::FunctionTerm{typeof(magic),Fa,Names}, d::NamedTuple) where {Fo,Fa,Names}
, but then I don't have access to the t
which means I can't access the contrasts
. The use case is for an AbstractCategoricalVector
to expand and then apply the operation to the AbstractVecOrMat
. For the Δ
I added a method to capture a function to be applied beforehand elementwise which adds some flexibility.
Without concrete code to look at I'm not sure what to say; can you share somehow? Maybe DM on slack if it's private code still?
Ideally these operations can be registered in an extension package to enhance StatsModels, until then I am defining some for the time being. Try this,
using StatsBase, DataFrames, ShiftedArrays, StatsModels
data = DataFrame(y = 1:5, x = categorical(['A', 'A', 'A', 'B', 'B']))
struct LagTerm{T} <: AbstractTerm
t::T
end
f = @formula(y ~ lag(x))
f = apply_schema(f, schema(data))
Goal is for
modelcols(f.rhs.terms[1], data) == [missing, 0, 0, 0, 1]
which is what I would get from
f = @formula(y ~ x)
f = apply_schema(f, schema(data))
lag(modelcols(f.rhs.terms[1], data))