formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

Getting variables and lhs, rhs

Open teucer opened this issue 2 years ago • 4 comments

Current it is difficult to get parts of the formula and what are the data.frame variables (i.e. the column names). It would be useful to add some facilities to enable further processing.

For example if you have formula log(y) ~ x, we would need to invert log to predict y. To be able to do so, we need to know lhs and that the variable is y.

teucer avatar Mar 10 '22 23:03 teucer

Hi @teucer. Thanks for reaching out.

Extracting the different parts of the formula is pretty straightforward in Formulaic 0.2.x; e.g.: y, X = model_matrix("y ~ x + y +z"). The ~ operator acts like a separator.

As of the current main branch, with changes I merged a few minutes ago, the above will still work; but so will mm = model_matrix("y ~ x + y"); mm.lhs ; mm.rhs.

As for extracting out the variables used, it's an interesting idea, and I've though about doing it in the past. The only tricky thing is that to formulaic, log and y are both just variables. We could do some sleuthing to figure out which one comes from the dataframe/etc, if that is useful.

matthewwardrop avatar Mar 14 '22 00:03 matthewwardrop

Extracting factors/variables would be very useful for our use case. As said for prediction purposes we need to know them. Obviously they would be context dependent, but should be doable (?)

teucer avatar Mar 14 '22 19:03 teucer

I would also request the ability to extract variable names. The particular issue I would have is how factors are encoded as columns. given y~factor1 + factor2 + factor1:factor2 it is not clear how to infer the columns used, and therefore extraction of coefficient "names".

seanv507 avatar Jun 13 '22 11:06 seanv507

Hi @seanv507 , thanks for reaching out. There is already support for mapping features in the formula to columns and vice versa; see for example <output>.model_spec.[feature_names|feature_indices|structure] etc. This API needs to be ratified and documented, but support does exist. What was new in this issue request was the ability to peek inside the string name of features to extract the columns used in the dataset (e.g. "log(y)" should indicate that "y" was used). This is not (yet) implemented.

matthewwardrop avatar Jun 20 '22 17:06 matthewwardrop

@teucer Do you have any thoughts on PR #145 which addresses this issue. Will likely merge before the end of the week (after cleanups).

matthewwardrop avatar Jun 22 '23 00:06 matthewwardrop

This is really cool, thx!

teucer avatar Jun 22 '23 19:06 teucer