formulaic
formulaic copied to clipboard
Add required variables to the `Formula` class
I would like to be able to do something like the following. Appologies I am struggling to articulate what I want but effectively I want the following.
Say I have the following formula.
apps ~ prior_apps + I(prior_apps^2) + factor + I(prior_apps:factor)
I am wondering if it is possible to get extract out the rhs terms from the formula. By terms I mean ['prior_apps','factor']
I have tried doing the following.
formula_parser = formulaic.parser.DefaultFormulaParser()
tokens = formula_parser.get_tokens(formula_str)
tokens = [t for t in tokens]
but that gets me the individual parts of the string and not the terms.
I feel like it should be possible?
Hi @timpiperseek ,
Does something like the following work?
from formulaic import Formula
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
factor
for term in f.rhs
for factor in term.factors
)
# This would output all the factors: {1, I(prior_apps ** 2), factor, prior_apps}
(Note that interaction terms should not be enclosed in "I(...)", since that is a Python function call).
If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)
) to extract the variables used; prior_apps
here.
If you are actually just looking for the terms, you can do: list(f.rhs) == [1, prior_apps, I(prior_apps ** 2), factor, prior_apps:factor]
.
Does that help?
yeah that is really close to what I am after.
what do you mean by
If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.
because ideally it would also identify that prior_apps**2 is the same underlying metric as prior_apps.
Ah... Using some internal utility functions you can do:
from formulaic import Formula
from formulaic.utils.variables import get_expression_variables
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
variable
for term in f.rhs
for factor in term.factors
for variable in get_expression_variables(factor.expr, {})
if "value" in variable.roles
)
# Outputs: {'factor', 'prior_apps'}
Note that get_expression_variables
parses the AST associated with the python expression, which is used internally to keep track of which variables have been used when generating the model matrix.
Oh that is absolutely awesome, thank you.
I'll consider adding this directly to the formula class as something like .required_variables
.
This would indeed be very handy, thx.