formulaic
formulaic copied to clipboard
API question: Bind model spec info to data container or not
Preamble: I have a hard time understanding some of the core code parts.
1. What is the reason to add metadata to the data container?
import pandas as pd
from formulaic import model_matrix
df = pd.DataFrame({
'a': ['A', 'B', 'C'],
})
X = model_matrix("a", df)
X
is then of type formulaic.model_matrix.ModelMatrix
which is wraps a pandas.DataFrame
.
isinstance(X, pd.DataFrame)
Naively, I would expect model_matrix
to return a 2-tuple consisting of a dataframe and a ModelSpec
.
2. Inspectability
Furthermore, X
has a model_spec
property. This is, however, not inspectable, i.e. it is not listed in dir(X)
(and no autocompletion).
The ModelMatrix
type allows you to proxy any type, and yet also offer a consistent API to downstream users. Having the model spec explicitly attached to the model matrix allows you to always be able to produce a new model matrix like the one you have. This seems much more elegant to me.
Re: not appearing in dir
, good point. I'll fix that.
Why not
y, X, model_spec = model_matrix("y ~ a + b + a:b", df)
This is also clean and avoids patching data containers.
If one attaches metadata like a model spec to a pandas dataframe, then this is most likely lost when performing operations on this dataframe (like copy
).
I know that patsy does it similarly. It might be a matter of personal taste, but I would refrain from doing so. In my opinion, it would make things simpler and cleaner.
Thanks for sharing your thoughts here!
I actually really like the existing pattern, and since it is just a stylistic/artistic choice, I'm going to use my prerogative as maintainer to keep with the status quo. However, the fact that it is confusing is something I'll try to do something about :).