formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

Add required variables to the `Formula` class

Open timpiperseek opened this issue 11 months ago • 6 comments

I would like to be able to do something like the following. Appologies I am struggling to articulate what I want but effectively I want the following.

Say I have the following formula. apps ~ prior_apps + I(prior_apps^2) + factor + I(prior_apps:factor)

I am wondering if it is possible to get extract out the rhs terms from the formula. By terms I mean ['prior_apps','factor']

I have tried doing the following.

formula_parser = formulaic.parser.DefaultFormulaParser()
tokens = formula_parser.get_tokens(formula_str)
tokens = [t for t in tokens]

but that gets me the individual parts of the string and not the terms.

I feel like it should be possible?

timpiperseek avatar Mar 07 '24 04:03 timpiperseek

Hi @timpiperseek ,

Does something like the following work?

from formulaic import Formula
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    factor
    for term in f.rhs
    for factor in term.factors
)
# This would output all the factors: {1, I(prior_apps ** 2), factor, prior_apps}

(Note that interaction terms should not be enclosed in "I(...)", since that is a Python function call).

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

If you are actually just looking for the terms, you can do: list(f.rhs) == [1, prior_apps, I(prior_apps ** 2), factor, prior_apps:factor].

Does that help?

matthewwardrop avatar Mar 08 '24 04:03 matthewwardrop

yeah that is really close to what I am after.

what do you mean by

If you need to, you could parse the AST represented by the non-lookup factors (e.g. I(prior_apps ** 2)) to extract the variables used; prior_apps here.

because ideally it would also identify that prior_apps**2 is the same underlying metric as prior_apps.

timpiperseek avatar Mar 08 '24 07:03 timpiperseek

Ah... Using some internal utility functions you can do:

from formulaic import Formula
from formulaic.utils.variables import get_expression_variables
f = Formula("apps ~ prior_apps + I(prior_apps**2) + factor + prior_apps:factor")
set(
    variable
    for term in f.rhs
    for factor in term.factors
    for variable in get_expression_variables(factor.expr, {})
    if "value" in variable.roles
)
# Outputs: {'factor', 'prior_apps'}

Note that get_expression_variables parses the AST associated with the python expression, which is used internally to keep track of which variables have been used when generating the model matrix.

matthewwardrop avatar Mar 09 '24 00:03 matthewwardrop

Oh that is absolutely awesome, thank you.

timpiperseek avatar Mar 10 '24 02:03 timpiperseek

I'll consider adding this directly to the formula class as something like .required_variables.

matthewwardrop avatar Mar 11 '24 16:03 matthewwardrop

This would indeed be very handy, thx.

mayer79 avatar Jun 23 '24 09:06 mayer79