formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

Add `include_intercept` to the `Formula` constructor?

Open teucer opened this issue 2 years ago • 6 comments

I think there is a discrepancy in the output of rhs:

  1. formula "y~x" -> rhs = "x+1"
  2. formula "y~x-1" -> rhs = "x"
  3. formula "y~x+0" -> rhs = "x"

I would have expected to get rhs = "x-1" or rhs = "x+0" for the second and third case.

teucer avatar Apr 06 '22 11:04 teucer

Hi @teucer ! Actually this is behaving as expected. In Wilkinson formulae, -1 and +0 both have the effect of removing the intercept. Is there something I'm missing?

matthewwardrop avatar Apr 06 '22 20:04 matthewwardrop

We want to be able to say model_matrix(rhs). But this would add back the intercept for 2nd and 3rd cases.

Maybe model_matrix should handle it:

  1. y~x
  2. ~x
  3. x

There could be a difference between 2 and 3. Right now they are handled the same way.

teucer avatar Apr 07 '22 06:04 teucer

Hmmmm... I'm a bit confused as to what you are trying to achieve here.

There was a bug in formulaic.model_matrix related to ingesting pre-processed formula that is now fixed, and will go out in the next point release, but basically you should be able to do this:

>>> import pandas
>>> from formulaic import Formula, model_matrix
>>> f = Formula("y ~ x - 1")
>>> f.rhs
x
>>> model_matrix(f.rhs, pandas.DataFrame({"x": [1,2,3]}))
	x
0	1
1	2
2	3

Why are you re-processing string representations of the right hand side?

matthewwardrop avatar Apr 07 '22 22:04 matthewwardrop

The issue is that in the current version

>>> model_matrix(f.rhs, pandas.DataFrame({"x": [1,2,3]}))
# -> FormulaInvalidError

>>> model_matrix("x", pandas.DataFrame({"x": [1,2,3]}))
        Intercept  x
0	      1.0  1
1	      1.0  2
2	      1.0  3

Given the bug, we were trying to do with the string.

The more fundamental questions are, should there be a difference:

  1. between Formula("~x") and Formula("x")
  2. if f = Formula("y ~ x - 1"), between model_matrix("x", df) and model_matrix(f.rhs, df)

My opinion:

  1. Yes -> the first one is with intercept, the second one without
  2. No -> see point 1.

teucer avatar Apr 08 '22 09:04 teucer

Hi again @teucer ,

The bug is now fixed, and I can roll out a new version for you if this is blocking you; otherwise I'll wait a few more days so it can also include some categorical encoding enhancements.

Your proposal is not unreasonable, but for consistency with expectations set by patsy and R, I'm going to make the executive choice not to treat ~x and x differently.

However, formulaic does (under the hood) support suppression of the addition of the intercept. For example, you could do:

>>> from formulaic import Formula
>>> from formulaic.parser import FormulaParser
>>> Formula(FormulaParser().get_terms("x", include_intercept=False))
x

It might be sensible to just add the include_intercept option in the Formula constructor, if that would be helpful to you.

matthewwardrop avatar Apr 08 '22 22:04 matthewwardrop

Having include_intercept in the constructor would be really helpful.

teucer avatar Apr 09 '22 01:04 teucer

Hi again!

As we move toward 1.0.0 I'm going back through all the issues and deciding things one way or another. Here, I'm deciding not to include this in the Formula constructor; however, since this issue was created it was moved to the parser constructor. So you can now do:

from formulaic import Formula
from formulaic.parser import DefaultFormulaParser
Formula("x", _parser=DefaultFormulaParser(include_intercept=False))

Due to modularity, and conflation with the Structured nature of Formula, this is no longer something I think would be a good idea.

Thanks again for sharing your thoughts and helping make Formulaic great!

I'll close this one out for now.

matthewwardrop avatar Apr 22 '23 21:04 matthewwardrop