mlr3pipelines
mlr3pipelines copied to clipboard
pack_formula()
provide a helper that strips an environment of a formula of all variables that are not referred to by the formula.
useful for i.e., PipeOpMutate and #410
This is a good idea. One thing in particular I was worried about was things where the user constructs a formula where he refers to a certain variable, and then changes the variable value before executing the pipeop.
A few pointers:
- this should be a helper for the user, I don't think we should be doing this internally
- there is the
all.vars()function that determines what variables are used inside an expression. The problem is, it does not recognize if the reference is as a function call or as an ordinary variable. This makes trouble in case the user has a function and a variable with both the same name (typical candidate here isc). I think we won't get around having to walk the AST to do this well. - a possible problem is that even if we recognize a value is being used inside a formula, it could be that the value is "masked" by the task. So the user does
~ x == y, andyoccurs both in the environment of the formula, but also in the task, the task'sywill have precedence, but the function won't know that at the time of packing - there is the general problem that variables in a formula could be "surprise masked" by the task, e.g. if the user doesn't expect there to be a
ycolumn in the task. Or if the task changes and suddenly the user has to rename his variables just because of name clashes. Maybe we want to offer a..()function (likecd ..in a shell) that lets the user access things outside the formula. variables in..()get precedence before columns in a task. Variables that are not in a..()are dropped bypack_formula()and are only available if they are in a package (so function calls still work). I think a way to implement this would be to walk the AST, check for..()-calls, check if they are at the place of a function call or a value access, and get the respective value. The formula itself should not be changed, instead the..()function should be an actual function that gets a value from the environment above the task column environment. This way- the
pack_formula()doesn't actually modify the formula - it is not necessary to use
pack_formula(), e.g. when the user doesn't worry about formula environments - the
..()thing is still useful even ifpack_formula()is not used (i.e. to avoid surprise variable name clashes)
- the
- What do you think of the design?
- Should I write this or do you feel up to the task?