tabmat
tabmat copied to clipboard
Support initializing matrices with Patsy?
I think we've discussed this, but I don't remember the conclusion and can't find an issue now.
We recommend from_pandas
as the way "most users" should construct tabmat objects. from_pandas
then guesses which columns should be treated as categorical. I think it would be really nice to have Patsy-like formulas as an alternative, since
- R users (including many economists) like using formulas, and
- It's easy to infer from a Patsy formula which columns are categorical, which are sparse (generally interactions with categoricals), and which are dense (everything else), so this could remove some of the guesswork from tabmat and improve performance.
I'm not sure how feasible this would be, since Patsy is a sizable library that allows for fairly sophisticated formulas and it would be quite an endeavor to replicate all of the functionality. A few ways of doing this would be
- Don't change any code, but document how Patsy can already be used to construct a dataframe that can then be passed to tabmat / glum. Warn that this involves creating a large dense matrix as an intermediate. See Twitter discussion: https://twitter.com/esantorella22/status/1447980727820296198
- Have tabmat call patsy.dmatrix with "return_type = 'dataframe'", then call tabmat.from_pandas on the resulting
pd.DataFrame.
That would not be any more efficient than (1), but would just save the user a little typing and the need to install patsy. On the down side, it adds a dependency and may force creation of a very large dense matrix. - Support very simple patsy-like formulas without having patsy as a dependency or reproducing its full functionality. That would allow the user to designate which columns should be treated as categorical in a more natural way. See Twitter discussion: https://twitter.com/esantorella22/status/1447981081358184461
- Make it so that any Patsy formula can be used to create a tabmat object -- I'm not sure how. Might be hard.
I like the idea, but just want to add a word of caution from my previous experience using patsy. Patsy seems to be focused on non-regularized models. For instance, it's rather cumbersome to specify a one-hot-encoded variable in patsy without dropping a column. I'm sure we could adapt patsy to our needs though.
While thinking about this, I found this: https://github.com/matthewwardrop/formulaic, which seems to be fixing some of patsy's issues and would be easier to integrate to tabmat (since it has sparse matrix support built-in).
As info, patsy has issues with pickle, see https://github.com/pydata/patsy/issues/26.
PR #267 proposes a formulaic
-based formula interface for tabmat
, and Glum PR #670 does the same downstream in glum
. Any comments or suggestions are much appreciated :)
Addressed by #286.