formulaic
formulaic copied to clipboard
DOC: Reusing generated model specifications
IS it possible to use a "fitted" transformer and evaluate a new (however, similar dataset)?
Let's have the following example:
import pandas
from formulaic import Formula
df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
trans = Formula('y ~ x + z')
trans.get_model_matrix(df)
df2 = pandas.DataFrame({
'y': [3, 3, 3],
'x': ['A', 'B', 'B'],
'z': [0.3, 0.1, 0.222222222],
})
trans.get_model_matrix(df2)
suppose that my df
is my training data and df2
are testing data.
If I create X matrix for the model training it outputs:
trans.get_model_matrix(df)
.rhs
Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.3
1 1.0 1 0 0.1
2 1.0 0 1 0.2
A category is a referenced one.
Now I want to do the same for testing data:
trans.get_model_matrix(df2)
.rhs
Intercept x[T.B] z
0 1.0 0 0.300000
1 1.0 1 0.100000
2 1.0 1 0.222222
As you can see this does not persist original design info and matrixes df
and df2
are not compatible. The model would fail ass the number of features is not the same.
Is this already implemented somehow?
Hi @petrhrobar ,
Yes... this is easily done with the current state of formulaic.
You can just do:
from formulaic import Formula
df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
trans = Formula('y ~ x + z')
mm1 = trans.get_model_matrix(df)
df2 = pandas.DataFrame({
'y': [3, 3, 3],
'x': ['A', 'B', 'B'],
'z': [0.3, 0.1, 0.222222222],
})
mm2 = mm1.model_spec.get_model_matrix(df2)
Or, using the sugar method model_matrix
in 0.3.4+:
mm2 = model_matrix(mm1, df2)
Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.
THanks, Matthew!
I was actually able to figure it out myself yesterday from a previous issue about this topic. So I guess it kind of boils down to the documentation :/.
If I may recommend and show how I want to use this is to have a sklean
component:
from sklearn.base import BaseEstimator, TransformerMixin
from formulaic import Formula, model_matrix
class FormulaicTransformer(TransformerMixin, BaseEstimator):
def __init__(self, formula):
self.formula = formula
def fit(self, X, y = None):
"""Fits the estimator"""
self._trans = model_matrix(self.formula, X).model_spec.rhs
return self
def transform(self, X, y= None):
"""Fits the estimator"""
X_ = self._trans.get_model_matrix(X)
return X_
pipe = Pipeline([
("formula", FormulaicTransformer("(bs(yday, df=12) + wday + num_date")),
("scale", StandardScaler()),
("model", LinearRegression())
])
As this persists the design info and can be pickled. It may be used as a proper sklearn component! This is a badass feature!
Nice! And yes... documentation will come... eventually!
This is a really cool use of Formulaic :). Maybe something like this makes sense to bring into formulaic itself at some point; or perhaps even better, upstream into sklearn
.
When used in libraries, though, I do recommend using Formula(...).get_model_matrix(...)
since that way the compute context is explicitly established. When using model_matrix
the default behaviour is to make the entire locals() and globals() context available to use in formulae. For local use, that's fine... for libraries, not so much.
Hi @petrhrobar ,
Yes... this is easily done with the current state of formulaic.
You can just do:
from formulaic import Formula df = pandas.DataFrame({ 'y': [0,1,2], 'x': ['A', 'B', 'C'], 'z': [0.3, 0.1, 0.2], }) trans = Formula('y ~ x + z') mm1 = trans.get_model_matrix(df) df2 = pandas.DataFrame({ 'y': [3, 3, 3], 'x': ['A', 'B', 'B'], 'z': [0.3, 0.1, 0.222222222], }) mm2 = mm1.model_spec.get_model_matrix(df2)
Or, using the sugar method
model_matrix
in 0.3.4+:mm2 = model_matrix(mm1, df2)
Hope that helps. I'll leave this open for documentation purposes until the docsite is updated.
I have been attempting to use Formulaic and ran in to the same issue. The above example does not seem to work anymore. Is there a different way of doing this now? Thanks in advance!
Hi @frederik-plum-hauschultz !
Since a while back (not sure if it was the case when I wrote this or not), the output of get_model_matrix()
is a structured Structured
instance that reflects the structure of the formula.
For example:
>>> from formulaic import Formula
>>> df = pandas.DataFrame({
'y': [0,1,2],
'x': ['A', 'B', 'C'],
'z': [0.3, 0.1, 0.2],
})
>>> trans = Formula('y ~ x + z')
>>> mm1 = trans.get_model_matrix(df)
>>> mm1
.lhs:
y
0 0
1 1
2 2
.rhs:
Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.3
1 1.0 1 0 0.1
2 1.0 0 1 0.2
>>> mm1.model_spec
.lhs:
<formulaic.model_spec.ModelSpec object at 0x7f0f43f6df10>
.rhs:
<formulaic.model_spec.ModelSpec object at 0x7f0f43f6df40>
It isn't possible in 0.3.x to call methods of the nested structure directly. I might add that in 0.4.x, but am not yet 100% convinced it is a good idea (maybe about 95% at the moment, leaning toward doing it, in which case it will appear in 0.4.0 shortly; feel free to nudge me if you like the idea).
In the meantime you can do:
df2 = pandas.DataFrame({
'y': [3, 3, 3],
'x': ['A', 'B', 'B'],
'z': [0.3, 0.1, 0.222222222],
})
mm2 = mm1.model_spec.rhs.get_model_matrix(df2)
or, if you want both the lhs and rhs bits done in one step:
>>> mm1.model_spec._map(lambda spec: spec.get_model_matrix(df2))
.lhs:
y
0 3
1 3
2 3
.rhs:
Intercept x[T.B] x[T.C] z
0 1.0 0 0 0.300000
1 1.0 1 0 0.100000
2 1.0 1 0 0.222222
Hope that helps.
Thank you for this! I was btw drawn to this package not so much due to performance (which seems to be 10x faster than patsy on my setup) but the fact that it can be pickled.
How would this work if I wanted a sparse output?
If I set
mm1 = trans.get_model_matrix(df, output='sparse')
I get back the expected sparse matrices, and the model specs are
.lhs:
ModelSpec(formula=y, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=y, scoped_terms=[y], columns=['y'])], transform_state={}, encoder_state={'y': (<Kind.NUMERICAL: 'numerical'>, {})})
.rhs:
ModelSpec(formula=1 + x + z, materializer='pandas', ensure_full_rank=True, na_action=<NAAction.DROP: 'drop'>, output='sparse', structure=[EncodedTermStructure(term=1, scoped_terms=[1], columns=['Intercept']), EncodedTermStructure(term=x, scoped_terms=[x-], columns=['x[T.B]', 'x[T.C]']), EncodedTermStructure(term=z, scoped_terms=[z], columns=['z'])], transform_state={}, encoder_state={'x': (<Kind.CATEGORICAL: 'categorical'>, {'categories': ['A', 'B', 'C']}), 'z': (<Kind.NUMERICAL: 'numerical'>, {})})
However, when reusing the model spec:
mm1.model_spec.rhs.get_model_matrix(df2)
A dataframe is returned, and I can't pass output='sparse'.
Hi @TELSER1 !
Thanks for reaching out! There was a regression introduced in 0.4.0 that I will be fixing shortly (hopefully tonight, added #102 to track it) whereby ModelSpec.output
is not respected. You can workaround this by rolling back to 0.3.x, or using:
from formulaic import model_matrix
model_matrix(model_spec, <new_data>, output='sparse')
or
from formulaic.materializers.pandas import PandasMaterializer
PandasMaterializer(<new data>).get_model_matrix(model_spec, output='sparse')
Hope that helps!
Thanks for the help! In the spirit of Frederik's comment, I'm particularly interested in the serializability and sparse output functionality; I am trying to estimate some large, sparse regression models.
@TELSER1 Thanks for the context! That's largely why I wrote formulaic
too :).
@TELSER1 This has been fixed and pushed out in v0.5.0 .