formulaic
formulaic copied to clipboard
Add a "safe" mode that avoids using `eval`?
In some cases, particularly in production code, it is nice to have a safe mode that avoids using eval
, at the expense of some functionality. This issue is here so this line of thought doesn't get lost, but also to see how important this is to people.
To be clear, removing eval
in the current state would basically cripple Formulaic by preventing any Python transformations (including built-in transformations). There may be a middle ground whereby we introspect the generated Python code AST, and verify that the functions to be executed are sanctioned.
This is related to an issue on patsy https://github.com/pydata/patsy/issues/156, where @FerusAndBeyond points out that running:
from formulaic import model_matrix
model_matrix('sys.exit()', ...)
will kill the running python service. Is there a way to protect formulaic from code injection?
At the moment, formulae does not use eval()
(see here), but it is still vulnerable to sys.exit()
calls if you have sys
loaded in the environment where design_matrices()
is called.
I do think we could port something similar to Formulaic and attempt to ban people calling functions from certain modules.
EDIT
You could have the following scenarios
-
sys
module is not loaded
model_matrix("sys.exit()")
No problem here (I think?). It is going to raise an error because sys
is not going to be found.
-
sys
module is loaded
model_matrix("sys.exit()")
Python service is killed
-
sys
module is loaded but developer prevented it from being used
formulaic.disable_modules(["sys"])
model_matrix("sys.exit()")
Python service is not killed. An error is raised because you can't call anything from the sys
module.
Does it make sense?
Yeah, that is correct!
It should also be noted model_matrix
is a user-convenience function. Software libraries should use Formula(...).get_model_matrix
which requires you to explicitly nominate the evaluation context, and it does not add the current evaluation context by default. You can simulate that in model_matrix
by passing None
or the the evaluation context dictionary via model_matrix(context=...)
, which then replaces the automatically imputed context.
I don't see avoiding eval
here to be that useful, since we're not actually adding any security if we will call the functions represented in the string anyway. I think the bigger thing is getting users familiar with the security risks, and using Formula().get_model_matrix()
in library code.
I'm going to close this for now. With proper context isolation (which libraries can already control), the use of eval
seems pretty sensible. Avoiding it as suggested by @tomicapretto basically means manually parsing the syntax tree and calling the function anyway. In both cases, since Formulaic parses the AST anyway, we could deny use of particular functions... but I don't see this being especially valuable. If this ever comes up again we can refresh this issue.