StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Implementing First-Difference

Open Nosferican opened this issue 5 years ago • 6 comments

I am trying to make a transformation à la PolyTerm. However, I fail to get the modelcols correctly when the term is a CategoricalTerm. It seems to parse correctly, but the expansion only does à la ContinousTerm.

using DataFrames, StatsBase, StatsModels
data = DataFrame(y = rand(10), x = rand(10), z = categorical(repeat(1:2, 5)))
formula = @formula(Δ(y) ~ Δ(x) + Δ(z))
Δ(obj::AbstractVector) = diff(obj)
Δ(obj::AbstractCategoricalVector) = obj[2:end]

struct FDTerm{T} <: AbstractTerm
    term::T
end

function StatsModels.apply_schema(t::FunctionTerm{typeof(Δ)}, sch)
    term = apply_schema(t.args_parsed[1], sch)
    FDTerm(term)
end

function StatsModels.modelcols(t::FDTerm, d::NamedTuple)
    modelcols(t.term, d)
end

sc = apply_schema(formula, schema(data))

modelcols(sc, data)

apply_schema(sc.rhs.terms[2], schema(data))

Nosferican avatar Mar 05 '19 20:03 Nosferican

The way you've defined it here modelcols is just returning the modelcols for the wrapped term. You need something like

StatsModels.modelcols(t::FDTerm, d::NamedTuple) = Δ(t.term, d)

Plus something like

using Tables: ColumnTable
Δ(obj::ContinuousTerm, d::ColumnTable) = Δ(getproperty(d, obj.name))

I'm not sure what the behavior you want is for a categorical term though, or even what would be reasonable...

kleinschmidt avatar Mar 08 '19 20:03 kleinschmidt

I will try it out and get back. For categorical in order to match length it would just be Δ(obj::AbstractCategoricalVector) = obj[2:end]. Might do the Union with the subarray catvals in case of subdataframes too.

Nosferican avatar Mar 08 '19 20:03 Nosferican

But that doesn't generate the actual numerical columns. I guess you could do

Δ(obj::CategoricalTerm, d::ColumnTable) = obj.contrasts[getproperty(d, obj.name)[2:end]]

which is basically what modelcols does for a categorical term, just using 2:end. But this seems like a bad strategy (or at least fragile) since it assumes that every term is wrapped in a Δ...and if you're going to make that assumption you might as well require that people call it like Δ(y ~ 1 + a + b)

kleinschmidt avatar Mar 08 '19 20:03 kleinschmidt

I have been successful at implementing the continuous AbstractVector terms, but still can't get it to work from AbstractCategoricalVector. I can't use StatsModels.modelcols(t::MyTerm, d::NamedTuple) = magic(t.term, d) because of broadcast coming to hunt me. Therefore, I need to use the modelcols(ft::FunctionTerm{typeof(magic),Fa,Names}, d::NamedTuple) where {Fo,Fa,Names}, but then I don't have access to the t which means I can't access the contrasts. The use case is for an AbstractCategoricalVector to expand and then apply the operation to the AbstractVecOrMat. For the Δ I added a method to capture a function to be applied beforehand elementwise which adds some flexibility.

Nosferican avatar Mar 12 '19 18:03 Nosferican

Without concrete code to look at I'm not sure what to say; can you share somehow? Maybe DM on slack if it's private code still?

kleinschmidt avatar Mar 12 '19 20:03 kleinschmidt

Ideally these operations can be registered in an extension package to enhance StatsModels, until then I am defining some for the time being. Try this,

using StatsBase, DataFrames, ShiftedArrays, StatsModels
data = DataFrame(y = 1:5, x = categorical(['A', 'A', 'A', 'B', 'B']))
struct LagTerm{T} <: AbstractTerm
    t::T
end
f = @formula(y ~ lag(x))
f = apply_schema(f, schema(data))

Goal is for modelcols(f.rhs.terms[1], data) == [missing, 0, 0, 0, 1] which is what I would get from

f = @formula(y ~ x)
f = apply_schema(f, schema(data))
lag(modelcols(f.rhs.terms[1], data))

Nosferican avatar Mar 12 '19 20:03 Nosferican