formulaic
formulaic copied to clipboard
Getting variables and lhs, rhs
Current it is difficult to get parts of the formula and what are the data.frame variables (i.e. the column names). It would be useful to add some facilities to enable further processing.
For example if you have formula log(y) ~ x
, we would need to invert log to predict y. To be able to do so, we need to know lhs and that the variable is y.
Hi @teucer. Thanks for reaching out.
Extracting the different parts of the formula is pretty straightforward in Formulaic 0.2.x; e.g.: y, X = model_matrix("y ~ x + y +z")
. The ~
operator acts like a separator.
As of the current main branch, with changes I merged a few minutes ago, the above will still work; but so will mm = model_matrix("y ~ x + y"); mm.lhs ; mm.rhs
.
As for extracting out the variables used, it's an interesting idea, and I've though about doing it in the past. The only tricky thing is that to formulaic, log
and y
are both just variables. We could do some sleuthing to figure out which one comes from the dataframe/etc, if that is useful.
Extracting factors/variables would be very useful for our use case. As said for prediction purposes we need to know them. Obviously they would be context dependent, but should be doable (?)
I would also request the ability to extract variable names. The particular issue I would have is how factors are encoded as columns. given y~factor1 + factor2 + factor1:factor2
it is not clear how to infer the columns used, and therefore extraction of coefficient "names".
Hi @seanv507 , thanks for reaching out. There is already support for mapping features in the formula to columns and vice versa; see for example <output>.model_spec.[feature_names|feature_indices|structure]
etc. This API needs to be ratified and documented, but support does exist. What was new in this issue request was the ability to peek inside the string name of features to extract the columns used in the dataset (e.g. "log(y)" should indicate that "y" was used). This is not (yet) implemented.
@teucer Do you have any thoughts on PR #145 which addresses this issue. Will likely merge before the end of the week (after cleanups).
This is really cool, thx!