StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Multipart Formula

Open matthieugomez opened this issue 8 years ago • 87 comments
trafficstars

Several arguments of a function fit typically refer to dataframe variables which are not regressors. Some examples: variables used to compute standard errors, weight variables, variables used to specify rows on which to estimate the model, high dimensional fixed effects, mixed models, etc.

It would be nice to think about the best syntax to refer to these variable. I have thought about three potential syntaxes:

  1. Define a macro for each argument that needs to capture variable names
    fit(df, @formula(y ~ x1), @weight(x2), @vcov(cluster(x3+x4)), @where(x5 >= 0), maxiter = 100)
    
  2. Define a model macro that accepts multiple arguments, i.e. @model(expr, args...).
    fit(df, @model(y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0)), maxiter = 100)
    
  3. Define a fit macro that accepts multiple arguments (syntax closest to reg in Stata and to @with in DataFramesMeta.jl), i.e. @fit(expr1, expr2, args...)
    @fit df y ~ x1 weight = x2 vcov = cluster(x3+x4) where = (x5 >= 0) maxiter = 100
    # or (either would work)
    @fit(df, y ~ x1, weight = x2, vcov = cluster(x3+x4), where = (x5 >= 0), maxiter = 100)
    

An additional benefit is that agreeing on a syntax would help to standardize the names of commonly used arguments like "weights" "vcov" "where" across different packages that do statistical estimations. Enforcing these keyword arguments across different statistical estimations, like in Stata, could do a lot to improve the user experience.

matthieugomez avatar May 15 '17 18:05 matthieugomez

I don't understand the use case for capturing variable names anywhere but in the actual construction of the model terms. Also, macros don't support keyword arguments.

ararslan avatar May 15 '17 19:05 ararslan

I have three difference use cases in FixedEffectModels, you can simply read the Readme. Or look at all the R packages that rely on multipart formula Macros can work with keyword arguments as they are just particular expressions. As in@test syntax

using Base.test
@test 1 ≈ 0.99 atol = 1e-2

matthieugomez avatar May 15 '17 19:05 matthieugomez

I always thought passing symbols for variable names would be enough. That doesn't work for where, but I'm not sure I like the idea of supporting where in model fitting functions, as it sounds simpler to perform this step separately.

nalimilan avatar May 15 '17 19:05 nalimilan

Maybe I will be clearer if I delve into FixedEffectModels. To estimate these kinds of models, an estimation command requires the user to specify (i) a formula (ii) a set of high dimensional fixed effects (iii) a set of variables to cluster on. A syntax that use symbol for everything except the formula, as proposed by @nalimilan, would look like that:

fit(@formula(y~x1+x2), fe = (:x3, :x4), vcov = cluster(:x5, :x6))

The syntax looks clumsy. Moreover, this syntax does not allow to use expressions such as x3&x4 in fe or vcov. That is, the syntax allows to regress on x3&x4, but does not allow to cluster on x3&x4.

To avoid these problems, package developers in R have turned to the Extended Model Formula R package. For instance, to estimate models with high dimentional fixed effects, the R package lfe uses the syntax:

felm(y~x1+x2 | x3&x4 + x4 | x6 + x7)

Although this syntax improves on the first one, I don't like it either because it is not clear what each part of the formula refers to. This is why I am pushing for a keyword syntax, something similar to

@formula(y~x1+x2, fe = x3&x4 + x4, vcov = cluster(x6 + x7))

While these examples focus on models with high dimentional fixed effects — the ones I am familiar with — these syntax issues are more common. A lot of R packages rely on the Extended Model Formulas R Package

matthieugomez avatar May 15 '17 19:05 matthieugomez

I guess what we could do is that a general macro like @model or @fit would transform every expression starting with ~ into a formula, so that you don't need to repeat @formula. But automatically transforming fe = x3 + x4 into a formula sounds difficult/impossible to me: it doesn't generalize, and yet that macro would have to be defined only once in StatsModels.

So the solutions I can imagine are like:

@fit(y ~ x1+x2, fe = (:x3, :x4), vcov = cluster(:x5, :x6))
@fit(y ~ x1+x2, fe = ~ x3 + x4, vcov = cluster(~ x5 + x6))

nalimilan avatar May 15 '17 19:05 nalimilan

We should just get better about allowing function calls in formulas. (Last time I checked, we either didn't support that at all or only kind of did.) Then you can do what R does for example with offset in Poisson regression: @formula(y ~ a + b + offset(c)) In your case it'd be @formula(y ~ x1 + x2 + fe(x3, x4) + cluster(x6, x7)).

ararslan avatar May 15 '17 19:05 ararslan

@ararslan I don't think this overloading would work. This overloading would completely abuse mathematical notations : clustered variables refer to methods to compute standard errors, not variables to regress on. Moreover, what if there are some functions defined fe or cluster in the namespace? Also, note that it is already possible to have function calls in formula in R, yet it did not prevent people to require multipart formulas.

matthieugomez avatar May 15 '17 20:05 matthieugomez

So, would one of my examples suit your needs?

nalimilan avatar May 15 '17 20:05 nalimilan

fit wouldn't even need to be a macro for the former example.

fit(@formula(y ~ x1 + x2), fe=(:x3, :x4), vcov=cluster(:x5, :x6), ...)

You just need a Symbol varargs method for cluster.

ararslan avatar May 15 '17 20:05 ararslan

Yes, that would work. But why would you need to only capture the arguments prefixed by ~? The macro could also capture everything, i.e. something like

macro fit(x, args...)
    Expr(call, :fit, :(@formula($(esc(x)))), (Base.Meta.quot(args[i]) for i in 1:length(args))...)
end 

Each package would then define the fit function.

matthieugomez avatar May 15 '17 20:05 matthieugomez

But not all arguments must be formulas. And in some cases you explicitly need to refer to local objects rather than to data frame columns.

nalimilan avatar May 15 '17 20:05 nalimilan

What about just using several @formula arguments?

andreasnoack avatar May 15 '17 20:05 andreasnoack

@nalimilan I agree, but this problem is common to every dplyr-like data manipulations (like DataFramesMeta).

@andreasnoack Something like this would work indeed.

fit(df, @formula(y ~ x1), fe = @formula(x3&x4 + x4), weight = @formula(x5), vcov = cluster(@formula(x3+x4))

It is quite verbose. I think I prefer Syntax 1

fit(df, @formula(y ~ x1), @fe(x3&x4 + x4)), @weight(x5), @vcov(cluster(x3+x4)))

but it requires to pollute the user space with macro definitions for each keyword argument.

matthieugomez avatar May 15 '17 20:05 matthieugomez

Maybe we can come back to this issue once there is a commonly accepted way to handle expressions that may or may not use variable names (i.e. something akin to a dplyr-like package). I just wanted to have this issue somewhere — it would be great to nail the syntax of estimation commands.

matthieugomez avatar May 15 '17 20:05 matthieugomez

I don't think the two issues are completely related (and that's good news since it makes it easier to fix the present one). Query macros need to deal with identifiers which most frequently refer to data frame columns, with a way to escape this occasionally (e.g. using $x, which was evoked at some point). fit/@fit is quite different, since most arguments do not refer to columns: data frame object, distribution family, number of iterations, etc. Arguments which have to refer to columns are more the exception than the rule. What we need is a way to make them not too painful to use, but not the default either. I'd say my proposals are reasonable compromises, though there are other solutions.

nalimilan avatar May 15 '17 21:05 nalimilan

Right, I agree with this.

matthieugomez avatar May 15 '17 21:05 matthieugomez

I am working on a econometrics package and would be nice for implementing estimators for panel data (cross sectional unit, time unit, cluster, instruments, etc.).

Nosferican avatar May 16 '17 01:05 Nosferican

@Nosferican Can you give some illustrations of your needs like @matthieugomez did above?

nalimilan avatar May 16 '17 07:05 nalimilan

It would be quite similar.

  • Model specification as formula
  • DataTable / DataFrame
  • Indicators for cross sectional unit and time indicator (x, t)
  • Instruments can be closely related to the formula argument
  • Clustering for standard errors (one or more features)
  • Weights for observations
fit(@formula(y ~ x1 + x2 + x3), df, @xt(x, t), family, link, estimator,
    @instruments(x3 ~ z1 + z2), @cluster(x4), @weights(x5))

Another examples of packages using expanded syntax in the formulas other than FixedEffectModels.jl is the MixedModels.jl

m = fit!(lmm(Yield ~ 1 + (1 | Batch), ds))
m2 = fit!(lmm(y ~ 1 + dept*service + (1|s) + (1|d), inst))
fm3 = fit!(lmm(Reaction ~ 1 + Days + (1+Days|Subject), slp))

Nosferican avatar May 16 '17 13:05 Nosferican

OK, looks similar indeed.

I think the design should be as generic as possible, i.e. it shouldn't require particular models to define macros like @cluster or @weights, since these are going to clash between packages. So arguments referring to variables inside the data frame should either be formulas, or (a tuple of) symbols when a formula wouldn't be natural (e.g. @xt(x, t) can be xt=(:x, :t), or even x=:x, t=:t).

To avoid the tedious requirement repeat @formula, the simplest and most systematic solution would be to implement a @fit macro which would replace all arguments starting with ~ with a formula. People would still be able to use @formula explicitly with the fit function if they prefer. Any objections to that approach?

(I would say the case of MixedModels is slightly different, since I don't think @dmbates has any complaints about the (1|x) syntax: these terms are really part of the same equation, contrary to clusters or instruments.)

nalimilan avatar May 16 '17 14:05 nalimilan

@nalimilan Your solution would work but I don't really like this syntax. Using ~ as a prefix to capture an expression is an R quirk and I would be glad to have it go away. That was the whole point of switching to @formula.

What about Syntax 2? In this syntax, @formula accepts multiple arguments, that are all captured. The captured arguments are then passed to a function _formula, which can be extended by package developers.

# Code in StatsModels
# @formula(ex, kw1 = arg1, kw2 = arg2) is transformed into  
# _formula(:(ex), kw1 = :(arg1), kw2 = :(arg2))
using StatsModels
macro formula(args...)
    Expr(:call, :_formula, (transform_expr(a) for a in args)...)
end
function transform_expr(ex)
   if isa(ex, Expr) && ex.head == :(=)
      return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
   else
      return Base.Meta.quot(ex)
  end
end
# StatsModel define _formula with 1 argument
function _formula(ex)
   StatsModels.Formula(ex.args[1], ex.args[2])
end

# Code in say FixedEffecModels
# Each package can redefine _formula with specific keyword arguments
function _formula(ex::Union{Expr, Symbol}; fe::Union{Expr, Symbol} = :nothing, weight::Union{Expr, Symbol} = :nothing, vcov::Union{Expr, Symbol} = :nothing)
  println("it works")
end
@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5))

The full syntax with fit is then something like:

fit(@formula(y ~ x1 + x2 + x3, fe = x1 + x2, weight = x3, vcov = cluster(x4 + x5)), maxiter = 100)

matthieugomez avatar May 16 '17 14:05 matthieugomez

The problem with that syntax is that it puts inside a common @formula call arguments which are unrelated: weights has nothing to do with vcov (or at least not more than with maxiter). I agree that ~ is a bit annoying when there is not LHS, but that's a small issue in comparison. Also, that symbol is not only used to escape an expression (which can be done using :() in Julia), it also indicates that the argument uses the formula syntax, i.e. terms and interactions can be specified using +, & and *. To me, that's a good reason to use the same symbol everywhere. If an argument doesn't need this syntax, it should just be a (list of) symbol(s).

nalimilan avatar May 16 '17 15:05 nalimilan

I believe the difference would most likely be whether we want @formula to be specific to a model specification with form response ~ variables with variables allowed to perform contrasts specified through +, *, and & or generalize it. Stata is a more pure formula system which operates with variables contrasts versus R which has enhanced formulas for on-the-fly feature generation and multipart formulas (i.e., it has a single lhs and the rhs is an array divided by the | operator). Examples include Stata i., c., #, ##, L, and D operators and R's log, poly, I, *, and multipart features.

Stata: reg y X1 i.X2 c.X3 X4#X5 X6##X7 L.X8 D.X9, cl(X10) R: log(y) ~ poly(X1, 2) + I(X2 >= 2) + X3 + X4 * X5 | X3 ~ Z1 + Z2

While I personally prefer to compute the features outside the formula in certain cases such as when computing the margins of a logistic regression (logit, multinomial, or ordinal) it is essential to keep track which variables are related and how which is done through passing the transformation in the formula. Easier cases such as with instrumental variables can be traced, but for harder cases it is almost impossible unless the feature engineering occurs in the formula. Other times it is just for convenience (e.g., weights as a function of the residuals).

While R uses the | operator to create multi-part rhs, I am a fan of keywords (pray every night for Julia to get keyword argument mapping like in R). I believe a Stata approach of a core expanded formula should be followed by , keyword arguments.

Guess my vote is in favor of keeping the formula with the bare minimum (indispensable and related) and pass additional arguments as keywords. For example, StatsModels.ModelFrame uses the formula, but passes the contrast dictionary as an additional argument.

Nosferican avatar May 16 '17 16:05 Nosferican

@nalimilan Syntax 2 is just a better version of R multipart formula. In these multipart formulas, all arguments specifying variable names are put together and separated by |. I just propose to add keyword arguments to identify each part more clearly. All arguments inside @formula (i.e. all the arguments that refer to variable names in Syntax 2) correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula could be called @model.

One issue with your proposal is that it defines two different syntaxes to refer to variable names: symbols for arguments that always refer to only one variable (like weight) vs formula for arguments that may refer to multiple variable names (like fe). I think the resulting syntax looks clumsy

@fit(y~x1, fe = ~x2, weight = :w)

Or it would require to impose people to use ~varname rather than :varname.

Now, if you really don't like the idea of splitting the arguments between @formula and fit (i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument

matthieugomez avatar May 16 '17 17:05 matthieugomez

Or it would require to impose people to use ~varname rather than :varname.

I guess we could also allow both if it turns out to be confusing.

Your arguments agains Syntax 2 are well taken, although I tend to disagree when you say that weights has nothing to do with vcov (compared to maxiter). All arguments inside @formula (i.e. all the arguments that refer to variable names in Syntax 2), tend to correspond to the model specification (i.e. weight, how to compute standard errors, etc). In some sense the macro @formula could be called @model.

Maybe they "tend to", but that's not always the case. For example, the model family for GLMs is clearly related to the model specification, but it's not a formula. And we haven't considered more exotic models yet.

Now, if you really don't like the idea of splitting the arguments between @formula and fit (i.e. Syntax 2), we can still use something like Syntax 3, i.e. a macro @fit that captures all arguments. Each developer can then deal with evaluating each argument separately depending on the keyword argument

I don't think that works either. The macro needs to decide whether to transform arguments before it knows the type of the model, so it cannot use dispatch to let each package choose a behavior.

nalimilan avatar May 16 '17 18:05 nalimilan

The macro would just capture all the expressions and pass them to the function. The function would then specify which argument should be evaluated. Is there a problem with that?

matthieugomez avatar May 16 '17 19:05 matthieugomez

Yes, the function would receive expressions which refer to a different scope. For example, maxiter=n would be transformed into maxiter=:n, but n wouldn't exist in the function's scope (actually, the scope of the module), or it could be shadowed by a local variable, by another keyword argument... More generally, calling eval inside functions isn't recommended in Julia.

nalimilan avatar May 16 '17 19:05 nalimilan

Correct me If I'm wrong, but I think we can avoid this issue. @fit(fe = z, maxiter = n) would call _fit(fe = :z, maxiter = :n) which would return the expression :(fit(fe = :z, maxiter = n)). This expression would then be evaluated in the user environment.

Just to give a rough sketch:

# Defined in StatsModels
macro fit(args...)
    _fit((transform_expr(a) for a in args)...)
end

function transform_expr(ex)
   if isa(ex, Expr) && ex.head == :(=)
      return Expr(:kw, ex.args[1], Base.Meta.quot(ex.args[2]))
   else
      return Base.Meta.quot(ex)
  end
end

# Defined in my package
function _fit(fe = :nothing, maxiter = :nothing)
   Expr(:call, :fit, Expr(:kw, :fe, Base.Meta.quot(fe)), Expr(:kw, :maxiter, maxiter))
end

# @fit(fe = x, maxiter = n) is then transformed into fit(fe = :x, maxiter = n)

matthieugomez avatar May 16 '17 19:05 matthieugomez

Again, that's the same problem: your _fit function will conflict with the one defined in another package since its signature does not have any specific to your package.

nalimilan avatar May 16 '17 20:05 nalimilan

The keyword arguments are specific to my package, no? And how is this problem the same as the one regarding the function scope?

matthieugomez avatar May 16 '17 20:05 matthieugomez