formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

DOC: Reusing generated model specifications

Open petrhrobar opened this issue 2 years ago • 11 comments

IS it possible to use a "fitted" transformer and evaluate a new (however, similar dataset)?

Let's have the following example:

import pandas
from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

trans.get_model_matrix(df2)

suppose that my dfis my training data and df2 are testing data. If I create X matrix for the model training it outputs:

trans.get_model_matrix(df)
.rhs
       Intercept  x[T.B]  x[T.C]    z
    0        1.0       0       0  0.3
    1        1.0       1       0  0.1
    2        1.0       0       1  0.2

A category is a referenced one.

Now I want to do the same for testing data:

trans.get_model_matrix(df2)
.rhs
       Intercept  x[T.B]         z
    0        1.0       0  0.300000
    1        1.0       1  0.100000
    2        1.0       1  0.222222

As you can see this does not persist original design info and matrixes df and df2 are not compatible. The model would fail ass the number of features is not the same.

Is this already implemented somehow?

petrhrobar avatar May 01 '22 12:05 petrhrobar

Hi @petrhrobar ,

Yes... this is easily done with the current state of formulaic.

You can just do:

from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

mm1 = trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.get_model_matrix(df2)

Or, using the sugar method model_matrix in 0.3.4+:

mm2 = model_matrix(mm1, df2)

Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.

matthewwardrop avatar May 02 '22 03:05 matthewwardrop

THanks, Matthew!

I was actually able to figure it out myself yesterday from a previous issue about this topic. So I guess it kind of boils down to the documentation :/.

If I may recommend and show how I want to use this is to have a sklean component:

from sklearn.base import BaseEstimator, TransformerMixin
from formulaic import Formula, model_matrix

class FormulaicTransformer(TransformerMixin, BaseEstimator):

    def __init__(self, formula):
        self.formula = formula

    def fit(self, X, y = None):
        """Fits the estimator"""
        self._trans = model_matrix(self.formula, X).model_spec.rhs
        return self

    def transform(self, X, y= None):
        """Fits the estimator"""        
        X_ = self._trans.get_model_matrix(X)
        return X_


pipe = Pipeline([
    ("formula", FormulaicTransformer("(bs(yday, df=12) + wday + num_date")),
    ("scale", StandardScaler()),
    ("model", LinearRegression())
])

As this persists the design info and can be pickled. It may be used as a proper sklearn component! This is a badass feature!

petrhrobar avatar May 02 '22 11:05 petrhrobar

Nice! And yes... documentation will come... eventually!

This is a really cool use of Formulaic :). Maybe something like this makes sense to bring into formulaic itself at some point; or perhaps even better, upstream into sklearn.

When used in libraries, though, I do recommend using Formula(...).get_model_matrix(...) since that way the compute context is explicitly established. When using model_matrix the default behaviour is to make the entire locals() and globals() context available to use in formulae. For local use, that's fine... for libraries, not so much.

matthewwardrop avatar May 02 '22 22:05 matthewwardrop

Hi @petrhrobar ,

Yes... this is easily done with the current state of formulaic.

You can just do:

from formulaic import Formula

df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

trans = Formula('y ~ x + z')

mm1 = trans.get_model_matrix(df)

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.get_model_matrix(df2)

Or, using the sugar method model_matrix in 0.3.4+:

mm2 = model_matrix(mm1, df2)

Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.

I have been attempting to use Formulaic and ran in to the same issue. The above example does not seem to work anymore. Is there a different way of doing this now? Thanks in advance!

Hi @frederik-plum-hauschultz !

Since a while back (not sure if it was the case when I wrote this or not), the output of get_model_matrix() is a structured Structured instance that reflects the structure of the formula.

For example:

>>> from formulaic import Formula

>>> df = pandas.DataFrame({
    'y': [0,1,2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

>>> trans = Formula('y ~ x + z')

>>> mm1 = trans.get_model_matrix(df)
>>> mm1
.lhs:
       y
    0  0
    1  1
    2  2
.rhs:
       Intercept  x[T.B]  x[T.C]    z
    0        1.0       0       0  0.3
    1        1.0       1       0  0.1
    2        1.0       0       1  0.2

>>> mm1.model_spec
.lhs:
    <formulaic.model_spec.ModelSpec object at 0x7f0f43f6df10>
.rhs:
    <formulaic.model_spec.ModelSpec object at 0x7f0f43f6df40>

It isn't possible in 0.3.x to call methods of the nested structure directly. I might add that in 0.4.x, but am not yet 100% convinced it is a good idea (maybe about 95% at the moment, leaning toward doing it, in which case it will appear in 0.4.0 shortly; feel free to nudge me if you like the idea).

In the meantime you can do:

df2 = pandas.DataFrame({
    'y': [3, 3, 3],
    'x': ['A', 'B', 'B'],
    'z': [0.3, 0.1, 0.222222222],
})

mm2 = mm1.model_spec.rhs.get_model_matrix(df2)

or, if you want both the lhs and rhs bits done in one step:

>>> mm1.model_spec._map(lambda spec: spec.get_model_matrix(df2))
.lhs:
       y
    0  3
    1  3
    2  3
.rhs:
       Intercept  x[T.B]  x[T.C]         z
    0        1.0       0       0  0.300000
    1        1.0       1       0  0.100000
    2        1.0       1       0  0.222222

Hope that helps.

matthewwardrop avatar Jul 18 '22 17:07 matthewwardrop

Thank you for this! I was btw drawn to this package not so much due to performance (which seems to be 10x faster than patsy on my setup) but the fact that it can be pickled.

How would this work if I wanted a sparse output?

If I set mm1 = trans.get_model_matrix(df, output='sparse')

I get back the expected sparse matrices, and the model specs are

.lhs:
    ModelSpec(formula=y, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=y, scoped_terms=[y], columns=['y'])], transform_state={}, encoder_state={'y': (<Kind.NUMERICAL: 'numerical'>, {})})
.rhs:
    ModelSpec(formula=1 + x + z, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x-], columns=['x[T.B]', 'x[T.C]']), EncodedTermStructure(term=z, scoped_terms=[z], columns=['z'])], transform_state={}, encoder_state={'x': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})

However, when reusing the model spec:

mm1.model_spec.rhs.get_model_matrix(df2)

A dataframe is returned, and I can't pass output='sparse'.

TELSER1 avatar Aug 27 '22 23:08 TELSER1

Hi @TELSER1 !

Thanks for reaching out! There was a regression introduced in 0.4.0 that I will be fixing shortly (hopefully tonight, added #102 to track it) whereby ModelSpec.output is not respected. You can workaround this by rolling back to 0.3.x, or using:

from formulaic import model_matrix

model_matrix(model_spec, <new_data>, output='sparse')

or

from formulaic.materializers.pandas import PandasMaterializer

PandasMaterializer(<new data>).get_model_matrix(model_spec, output='sparse')

Hope that helps!

matthewwardrop avatar Aug 28 '22 01:08 matthewwardrop

Thanks for the help! In the spirit of Frederik's comment, I'm particularly interested in the serializability and sparse output functionality; I am trying to estimate some large, sparse regression models.

TELSER1 avatar Aug 28 '22 15:08 TELSER1

@TELSER1 Thanks for the context! That's largely why I wrote formulaic too :).

matthewwardrop avatar Aug 28 '22 19:08 matthewwardrop

@TELSER1 This has been fixed and pushed out in v0.5.0 .

matthewwardrop avatar Aug 29 '22 05:08 matthewwardrop