formulaic
formulaic copied to clipboard
Extending `formulaic` to work with other input types
Context:
My team uses patsy
heavily. One aspect of patsy
that makes it great for our use cases is the fact that patsy
isn't strict about its input types. For example, patsy
works when inputted pd.DataFrame({"a": np.array([1,2,3])})
equally as well as when inputted {"a": np.array([1,2,3])}
, and we have use cases for storing data as a dictionary of numpy arrays.
Problem:
Currently, we are in the process of placing patsy
with formulaic
in our workflow, and we encountered a problem because unlike patsy
, formulaic
throws an error if we try passing in a dictionary of numpy arrays the way we did when using patsy
. The error we get is FormulaMaterializerNotFoundError: No materializer has been registered for input type 'builtins.dict'
. However we noticed that formulaic
's docs mentions:
- extensible data input/output plugins, with implementations for:
- input:
pandas.DataFrame
pyarrow.Table
- output:
pandas.DataFrame
numpy.ndarray
scipy.sparse.CSCMatrix
So we suspect that the lack of support for builtins.dict
isn't a fundamental limitation of formulaic
but rather something we need to provide if we want to use formulaic
the way we have been using patsy
.
Attempted solution:
We did some digging into materializers
and was able to hack together something that works (i.e., passes all our existing unit tests) without really understanding what materializers
are for and how they work.
In essence, we created a custom DictMaterializer
and passed that into model_matrix
like this (where data
in this example is a dictionary of numpy arrays):
formulaic.model_matrix(spec, data, na_action="ignore", materializer=DictMaterializer)
And we defined DictMaterializer
thusly:
from typing import Any, Dict, Sequence, Tuple
from formulaic.model_spec import ModelSpec
from interface_meta import override
from formulaic.materializers import PandasMaterializer
class DictMaterializer(PandasMaterializer):
REGISTER_NAME = "builtins"
REGISTER_INPUTS = ("builtins.dict", )
REGISTER_OUTPUTS = ("builtins",)
@override
def _combine_columns(self, cols: Sequence[Tuple[str, Any]], spec: ModelSpec, drop_rows: Sequence[int]
) -> Dict:
return {col[0]: col[1] for col in cols}
We recognize that our solution is probably not ideal for 3 reasons:
- We subclassed from
PandasMaterializer
instead ofFormulaMaterializer
. We did this to get something working quickly. - We only overloaded the
_combine_columns
method while totally ignoring all the other ones implemented byPandasMaterializer
. Again,_combine_columns
was the only method we had to overload to get our unit tests to pass. - We currently only have use cases for
na_action="ignore"
and so in our implementation of_combine_columns
, thedrop_rows
input is complete ignored.
Help needed:
We are reaching out in this forum to see if we are on the right track, and if so, what we can do to get our solution to a more ideal state.
It would also be very helpful if we can get a summary/explanation for what materializers
are for as well as how they are meant to work.
@matthewwardrop
Hi @bnjhng. Thanks for reaching out.
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
In terms of the right solution to this, your approach isn't crazy! It likely misses some edge-cases, but it probably works in the majority of cases. I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
In terms of the general role of materialization, did you read the generic docs here: https://matthewwardrop.github.io/formulaic/guides/formulae/#materialization ?
Thanks for linking it! Somehow I missed that section as I was going through the docs.
I've been tossing up adding support for dicts like this by just casting them to a pandas DataFrame in the default pandas materializer. Is there a reason that you want to avoid this?
We have found that very often, when performing operations that do not involve row indexing (i.e., the vast majority of data transformations), working in dictionary of numpy arrays has speed advantage over pandas DataFrame.
It likely misses some edge-cases, but it probably works in the majority of cases.
This is indeed our conclusion! The main limitation we have encountered so far is with categorical encoding. And we have isolated the main issue to the following:
>>> from formulaic.materializers.types import FactorValues
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # while this works as expected:
>>> print(pd.Series(FactorValues(pd.Series(["a", "b", "c"]))))
0 a
1 b
2 c
dtype: object
>>> # this doesn't give the expected results:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"]))))
0 abc
1 bc
2 c
dtype: object
>>> # and this straight up errors out:
>>> print(pd.Series(FactorValues(np.array([1, 2, 3]))))
TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got FactorValues)
The doctoring of FactorValues
says that it is:
A convenience wrapper that surfaces a
FactorValuesMetadata
instance at<object>.__formulaic_metadata__
. This wrapper can otherwise wrap any object and behaves just like that object.
But clearly, in the case of numpy.ndarray
, the wrapper doesn't behave just like numpy.ndarray
. To get the code above to work properly, one way is to explicitly call the __wrapped__
attribute:
# both of the following works as expected:
>>> print(pd.Series(FactorValues(np.array(["a", "b", "c"])).__wrapped__))
0 a
1 b
2 c
dtype: object
>>> print(pd.Series(FactorValues(np.array([1, 2, 3])).__wrapped__))
0 1
1 2
2 3
dtype: int64
However, there are places in the formulaic code that simply does pandas.Series(data)
instead of pandas.Series(data.__wrapped__)
, for example here and here.
Would it be possible to fix this limitation with formulaic?