StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Get rid of DataFrameStatisticalModel

Open nalimilan opened this issue 8 years ago • 20 comments

The current approach of wrapping fitted StatisticalModel objects within DataFrameStatisticalModel/DataTableStatisticalModel objects has a major limitation: we can only @delegate functions that we know about, i.e. only those defined in StatsBase. This sounds problematic in the long term, where custom model types defined in packages may want to support new operations.

A solution to this would be to move from the current composition approach (wrapper) to inheritance. StatisticalModel object would be required to have some predefined fields, like coefnames, formula, modelmatrix and modelframe. The fit method from StatsModels would set these fields after fitting the model, so that they can be accessed later both by generic methods and by custom methods defined for a given model type.

This solution would for example allow implementing a generic Wald test function in StatsBase (https://github.com/JuliaStats/StatsBase.jl/issues/300), which would use the coefficient names stored in the object. This is not possible with the current DataFrameStatisticalModel approach, where coefficient names can only be accessed by methods designed for DataFrameStatisticalModel, which therefore cannot live in StatsBase.

nalimilan avatar Sep 11 '17 15:09 nalimilan

The key consideration here is how we generally provide a high and low level API's for regression models such that you can enjoy high level convenience while still having access to all the functionality without overhead when you need to fit millions of models. Hence, we'd like to make sure that fields with names are not baked into that whole setup. This was the original reason that GLM's dependency on DataFrames was removed and DataFrameStatisticalModel introduced here.

andreasnoack avatar Sep 13 '17 10:09 andreasnoack

Yes, optional fields would remain null/nothing by default , e.g. if you don't want names for your coefficients.

nalimilan avatar Sep 13 '17 10:09 nalimilan

I'm skeptical that this is right model. Parameter names are not needed for a low level API and they mainly make sense in connection with some kind of dataframe. In the future, we might also be able to fit models based on e.g. streaming data sources so having prespecified fields for all possible regression models seems overly restrictive. Hence, I'm in favor of something like the existing wrapper approach where high level convenience is added alongside the actual fitting procedures in a wrapper object.

andreasnoack avatar Sep 13 '17 11:09 andreasnoack

Why would coefficient names be specific to data frames? Couldn't you want coefficient names even when fitting directly from a matrix or from a streaming source?

I'm not proposing to add field for "all possible regression models": these fields are quite commonly needed for any model family. OTC, the current wrapper approach isn't extensible as any new function a given model family might want to support would need to be added to StatsBase so that we can use it with @delegate. What do you suggest to fix this?

nalimilan avatar Sep 13 '17 12:09 nalimilan

What's the advantage of fields over accessor methods? It seems like that's orthogonal to the inheritance issue.

-- [email protected] http://davekleinschmidt.com 413-884-2741

On Sep 13, 2017, at 08:16, Milan Bouchet-Valat [email protected] wrote:

Why would coefficient names be specific to data frames? Couldn't you want coefficient names even when fitting directly from a matrix or from a streaming source?

I'm not proposing to add field for "all possible regression models": these fields are quite commonly needed for any model family. OTC, the current wrapper approach isn't extensible as any new function a given model family might want to support would need to be added to StatsBase so that we can use it with @delegate. What do you suggest to fix this?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kleinschmidt avatar Sep 13 '17 12:09 kleinschmidt

The advantage is that StatModels can set these fields after fitting the model, which is needed since the model has no idea of the names and formulas. We could also define setter methods which StatsModels would use, or standardize keyword arguments to pass to fit. That would allow model implementations to choose the best way to store this information.

Finally, we could have a special struct with the fields listed above, and models would just need to provide a function to get/set that object. That would have the advantage of being easily extensible (e.g. if we need to add a new field) without requiring any adjustment from model types. And when you don't need it, that object would be replaced with null/nothing (no need to nullify multiple fields).

nalimilan avatar Sep 13 '17 13:09 nalimilan

I get both sides as I am a strong proponent of accessor methods à la StatsBase.jl, but internally I use the special structs heavily. In practice, I like the flexibility and quality control I get from defining each of the applicable methods my structs will use from StatsBase.RegressionModel<:StatsBase.StatisticalModel. Internally, I use a special structs for handling my formulas, pre-modeling phase, actual model, regression output, etc. In practice, if you want things to play well with one another you want an accessor method. Hardcoded structs like DataFrameStatisticalModel usually more pain than ease.

The example you cite about Wald tests for models that have coefficients and a variance covariance matrix is very relevant. However, the way to address it is through a robust API. Methods like coefnames, vcov, coef, dof, dof_residual should be defined at the abstract level of StatsBase.RegressionModel or StatsBase.StatisticalModel. Then it is very easy to play well with others... For a nightmare scenario, CovarianceMatrices.jl is hardcoded to work pretty much only with GLM.jl. Hardcoded structs are also more prone to errors... People might assume that the degrees of freedom are the length of the model and not take into account fixed effects for example. Those scenarios are best handled by accessor methods.

As an aside, usually fit! is a StatsBase.jl associated term. Implementing an advanced fit! at StatsModels.jl would be lowering the level which I would oppose.

tltr: +1 for inheritance.

Nosferican avatar Nov 07 '17 21:11 Nosferican

It seems really hard to me to avoid some kind of wrapper and still be able to support the high and low level APIs together (or, rather, to get high-level table+formula support for free if you have low-level matrix/vector support). You need some way to keep the formulae together with the fitted model in order to do things like predict from un-observed data.

kleinschmidt avatar Nov 30 '17 04:11 kleinschmidt

Which isn't to say that that wrapper can't inherit from an abstract regression/statistical model type, but that's what the current setup does, right?

kleinschmidt avatar Nov 30 '17 04:11 kleinschmidt

For example, for instrumental variables the predictions are not modelmatrix(newmodel) * coef(model) where newmodel is derived from newdata, coefnames, Formula, constrasts (coefnames if the linearly dependent variables are dropped or verify NaN in coef if those are kept). The current wrapper allows for one ModelFrame when for example my models use more than four. For example,

function dof(obj::MyModel)
    mm = modelmatrix(obj)
    output = trace(mm * inv(cholfact!(mm.'mm)) * mm.') - ("(Intercept)" ∈ coefnames(obj))
end

might be a sensible default value, but if there are fixed effects it would be wrong. For example, r2 is defined automatically, but is not defined for instrumental models regardless if they are linear.

I believe the high level API should be left to the packages. However, StatsModels can provide tools to make it easier in a less restrictive manner. For example, maybe having another module or expand StatsBase to allow for a more fine grained abstraction. For example,

instrumental(obj::RegressionModel) = false

and use it for r2.

Nosferican avatar Nov 30 '17 04:11 Nosferican

@Nosferican The problem is not with overriding methods, that's perfectly OK and we would never define a default dof fallback as you show above as it's indeed invalid for many models. The issue is about functions which are not part of the basic API defined in StatsBase. For example, it doesn't make a lot of sense to define instrumental in StatsBase, since it does not apply to most models: we don't want to define a new function for each feature a model family would want to support.

Which isn't to say that that wrapper can't inherit from an abstract regression/statistical model type, but that's what the current setup does, right?

@kleinschmidt Yes, but then if a package wants to define e.g. a hazard function for survival models, it needs to define a method for DataFrameRegressionModel which delegates the call to the wrapped model. That's not ideal (and that clearly means all modeling packages need to depend on StatsModels).

It seems really hard to me to avoid some kind of wrapper and still be able to support the high and low level APIs together (or, rather, to get high-level table+formula support for free if you have low-level matrix/vector support). You need some way to keep the formulae together with the fitted model in order to do things like predict from un-observed data.

Why would it be a problem? That's really just composition vs. inheritance. StatisticalModel objects can be required to have a field (or several) where StatsModels functions will store the data it needs (as I proposed above).

nalimilan avatar Nov 30 '17 10:11 nalimilan

@nalimilan That's exactly why inheritance would be best. The big takeaway though is in the specifying the fields for the data. For example, currently it is restricted to a ModelFrame when models might require multiple ModelFrame with the current set up. Through inheritance that is not restricted because package developers can set up the data freely and everything plays nice with accessor methods. I believe StatsModels ideal role is not to set up the way to store the data, but give tools to make it easy for developers to store the data in the way it makes most sense for them. For example, ModelFrame / ModelMatrix allows them to generate the coefnames, keep track of terms / intercept, generate the eventual modelmatrix and model_response, etc.

Nosferican avatar Nov 30 '17 18:11 Nosferican

It seems like either way, even if they don't explicitly handle formulae/tabular data themselves (like GLM), packages at least need to be aware of the tabular data API that StatsModels provides: either to delegate functions (in the current setup) or to provide a way for statsmodels functions to store the formulae/terms that are needed to convert tabular data into modelable form (if, like for instance GLM.jl, they don't handle that transformation themselves)

kleinschmidt avatar Dec 04 '17 18:12 kleinschmidt

Right. The difference is that with composition all model types would have to implement the same basic StatsModels interface, while with inheritance (EDIT: with wrapping) each model type would have to delegate its custom functions (if any).

Another significant difference is that with inheritance all model types will be wrapped under a StatisticalModel object, so their actual type will be hidden. If a third-party package wants to provide functions e.g. for RegressionModel to compute different kinds of covariances, marginal effects or plot results, it will have to depend on StatsModels and have a function which will unwrap the model before dispatching on its type. Without that, it won't even be able to detect that a model is e.g. a RegressionModel. I don't find it very nice to hide the type information behind one layer, especially since types are really essential in Julia.

nalimilan avatar Dec 04 '17 21:12 nalimilan

Do you mean "under composition"? I thought the point of inheritance is that models would not all be hidden under a wrapper.

Also, wouldn't it be possible to make the wrapper parametric on the model type? That was what I had in mind, at least for the model builder stage.

kleinschmidt avatar Dec 05 '17 17:12 kleinschmidt

Do you mean "under composition"? I thought the point of inheritance is that models would not all be hidden under a wrapper.

Yeah, that wasn't very clear. By inheritance I meant the current system which wraps the model object and inherits from StatisticalModel. But that's not the best terminology here: in both cases that's composition.

Also, wouldn't it be possible to make the wrapper parametric on the model type? That was what I had in mind, at least for the model builder stage.

I guess we could do that to at least allow package to provide functions only for the relevant parametric type. But it's not possible to make a type inherit from a different abstract type depending on its type parameters, so that won't avoid the need for packages to write these delegating methods.

nalimilan avatar Dec 05 '17 18:12 nalimilan

Delegating methods is not cumbersome at all. I would think that the more complex methods vcov and parts of fit would most likely eventually be handled by other packages. For example, a variance covariance package could handle those (same for fitting methods for example GLM family and link transformations). The only difference would be that the DataFrame, Formula, and whatnot would be fields in the structs.

Nosferican avatar Dec 05 '17 19:12 Nosferican

It just occurred to me how crazy it is that fit(GeneralizedLinearModel, ::Formula, ::DataFrame) does not return a GeneralizedLinearModel.

One way around this might be to add support for a generic kind of "data transformer" struct or schema to be stored alongside the model itself. (Inspired in part by the JuliaDB.ML schema, which is a kind of stripped down formula/modelframe). That is to say, I'm coming around to @nalimilan's suggestion that part of the StatisticalModel interface is that you have methods like datatransformer, datatransformer!, etc. and default to storing the ModelFrame (or something lighter-weight that doesn't keep all the data...) and uses that to generate predictors/response, get coefnames, etc.

kleinschmidt avatar May 21 '18 15:05 kleinschmidt

I had a shower thought this morning about how you could use traits to allow for some flexibility in how models implement this API. So you could have one trait value that says "I know nothing about formulae or tables so I need a wrapper" and another one that says "I can handle arbitrary data pre-processing steps in they're stored in a field call .transformation" (or something). Then we could gracefully sunset the old API by adding deprecations to the methods for the first trait.

kleinschmidt avatar Apr 04 '19 17:04 kleinschmidt

I thought we could just have packages which don't need DataFrameRegressionModel implement fit(::Type{<:MyModel}, ::FormulaTerm, ...) so that they take precedence over the fallback method defined in statsmodel.jl?

nalimilan avatar Apr 05 '19 13:04 nalimilan