mlr3pipelines icon indicating copy to clipboard operation
mlr3pipelines copied to clipboard

pack_formula()

Open sumny opened this issue 5 years ago • 1 comments

provide a helper that strips an environment of a formula of all variables that are not referred to by the formula. useful for i.e., PipeOpMutate and #410

sumny avatar Oct 10 '20 16:10 sumny

This is a good idea. One thing in particular I was worried about was things where the user constructs a formula where he refers to a certain variable, and then changes the variable value before executing the pipeop. A few pointers:

  • this should be a helper for the user, I don't think we should be doing this internally
  • there is the all.vars() function that determines what variables are used inside an expression. The problem is, it does not recognize if the reference is as a function call or as an ordinary variable. This makes trouble in case the user has a function and a variable with both the same name (typical candidate here is c). I think we won't get around having to walk the AST to do this well.
  • a possible problem is that even if we recognize a value is being used inside a formula, it could be that the value is "masked" by the task. So the user does ~ x == y, and y occurs both in the environment of the formula, but also in the task, the task's y will have precedence, but the function won't know that at the time of packing
  • there is the general problem that variables in a formula could be "surprise masked" by the task, e.g. if the user doesn't expect there to be a y column in the task. Or if the task changes and suddenly the user has to rename his variables just because of name clashes. Maybe we want to offer a ..() function (like cd .. in a shell) that lets the user access things outside the formula. variables in ..() get precedence before columns in a task. Variables that are not in a ..() are dropped by pack_formula() and are only available if they are in a package (so function calls still work). I think a way to implement this would be to walk the AST, check for ..()-calls, check if they are at the place of a function call or a value access, and get the respective value. The formula itself should not be changed, instead the ..() function should be an actual function that gets a value from the environment above the task column environment. This way
    1. the pack_formula() doesn't actually modify the formula
    2. it is not necessary to use pack_formula(), e.g. when the user doesn't worry about formula environments
    3. the ..() thing is still useful even if pack_formula() is not used (i.e. to avoid surprise variable name clashes)
  1. What do you think of the design?
  2. Should I write this or do you feel up to the task?

mb706 avatar Oct 12 '20 16:10 mb706