bofire
bofire copied to clipboard
[Bug]: silent failure for model_type strings whose quadratic terms are not separated by {}'s
What happened?
Having non-deliminated-by {}'s quadratic terms just causes Formulaic to drop them. The hint i had that the preferred way of dealing with quadratics in Formulaic is via {}s was one line of one test, no comment. .
Request: do not just fail silently. Add a regex that either fixes the situation or error out.
- That is, the following should all be equivalent: {x**2}, {x*x}, {x * x}, x**2, x*x, x * x, x:x
- Currently, only these are equivalent {x**2}, {x*x}, {x * x}, {x : x}
- and these are equivalently not seen and treated as null: x**2, x*x, x * x, x:x
Moreover, Formula() applied to the output of get_formula_from_string where model_type is a shortcut (e.g. "fully-quadratic") should be idempotent. Using the domain below, we have
formula = get_formula_from_string(model_type="fully-quadratic", inputs=inputs)
produces the string "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 + x1:x2 + x1:x3 + x2:x3"
and applying Formula() to this string yields "1 + x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3".
get_formula_from_string should instead return "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2} + x1:x2 + x1:x3 + x2:x3", which remains fixed under application of Formula() (desired behavior).
Behavior demonstration:
import os
from bofire.data_models.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (
ContinuousInput,
ContinuousOutput,
)
from bofire.strategies.doe.utils import get_formula_from_string
models = {
"linear-and-quadratic": "linear-and-quadratic",
"model_1": "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2",
"model_1a": "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}",
"model_2": "1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3",
"model_2a": "1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}",
"model_3": "1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3",
"model_3a": "1 + x1 + x2 + x3 + {x1 * x1} + {x2 * x2} + {x3 * x3}",
}
for name, expr in models.items():
formula = get_formula_from_string(model_type=expr, inputs=inputs)
print(
f"{name}: {expr}.\n\tAs a Formula: {formula}.\n\tExpected experiments: {len(formula) + 3}"
)
Outputs:
See also
This is the functionality going into 'get_required_number_of_experiments" in DoEStrategy:
and inside https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L62
then
https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L141
This will also fail silently inside DoEOptimality Criterion here https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/data_models/strategies/doe.py#L51
Please provide a minimal, reproducible example of the unexpected behavior.
see above
Please paste any relevant traceback/logs produced by the example provided.
BoFire Version
0.2.1
Python Version
3.10.12
Operating System
WSL: Ubuntu-22.04
len(Formula(model string)) is not used consistently everywhere, see https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/objective.py#L70
self.formula = deepcopy(formula)
self.model_terms_string_expression = convert_formula_to_string(
domain=domain, formula=formula
)
self.n_model_terms = len(list(np.array(formula, dtype=str)))
Also concerning https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L155-L184
This is more complex than I can handle. Need an assist from @dlinzner-bcs @R-M-Lee @bertiqwerty e.g.
Hi @rosonaeldred,
as you maybe noticed I don't really find time to work on bofire stuff at the moment anymore :/ But now I am sitting in the train for quite a while and I had time to look into this issue a bit. Here is what I can tell about this issue, hoping it helps:
- Regarding the point about
get_formula_from_string(): Here I do not see a problem since it returns not a string but a formula object, which indeed contains all intended terms and remains unchanged w.r.t. to repeated applications of Formula() (see code and output below). I think we do not have control over the string representation of Formula objects (since they are defined in formulaic). So we will have to will live with the fact that the wavy brackets will be lost in the string representations, the returned Formula objects are, however, still unchanged under repeated application of Formula().
import os
from bofire.data_models.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (
ContinuousInput,
ContinuousOutput,
)
from bofire.strategies.doe.utils import get_formula_from_string
from bofire.data_models.domain.api import Inputs
from formulaic import Formula
domain = Domain.from_lists(
inputs=[
ContinuousInput(
key=f"x{i + 1}",
bounds=(0, 1),
)
for i in range(3)
],
outputs=[ContinuousOutput(key="y")],
)
models = {
"linear-and-quadratic": "linear-and-quadratic",
"model_1": "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2",
"model_1a": "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}",
"model_2": "1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3",
"model_2a": "1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}",
"model_3": "1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3",
"model_3a": "1 + x1 + x2 + x3 + {x1 * x1} + {x2 * x2} + {x3 * x3}",
}
for name, expr in models.items():
formula = get_formula_from_string(model_type=expr, inputs=domain.inputs)
print(name)
print("expression: ", expr)
print("formula: ", formula, ", ", list(formula))
print("Formular(formula): ", Formula(formula), ", ", list(Formula(formula)))
print()
output:
linear-and-quadratic
expression: linear-and-quadratic
formula: 1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 , [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
Formular(formula): 1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 , [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
model_1
expression: 1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2
formula: 1 + x1 + x2 + x3 , [1, x1, x2, x3]
Formular(formula): 1 + x1 + x2 + x3 , [1, x1, x2, x3]
model_1a
expression: 1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}
formula: 1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 , [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
Formular(formula): 1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 , [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
model_2
expression: 1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3
formula: 1 + x1 + x2 + x3 , [1, x1, x2, x3]
Formular(formula): 1 + x1 + x2 + x3 , [1, x1, x2, x3]
model_2a
expression: 1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}
formula: 1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3 , [1, x1, x2, x3, x1 * x1, x2 * x2, x3 * x3]
Formular(formula): 1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3 , [1, x1, x2, x3, x1 * x1, x2 * x2, x3 * x3]
model_3
expression: 1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3
formula: 1 + x1 + x2 + x3 , [1, x1, x2, x3]
Formular(formula): 1 + x1 + x2 + x3 , [1, x1, x2, x3]
- The parsing works according to the Formula grammar (https://matthewwardrop.github.io/formulaic/latest/guides/grammar/) described in the formulaic documentation.
I think it will be a lot of work to adapt this grammar (i.e. "fix" formula expressions automatically) while still correctly parsing the most general model formulas with formulaic. But indeed the grammar can be counter-intuitive. Therefore I would think it is better to take the second approach that you suggested and throw a warning (better not an error imo) if we suspect that the parsing in formulaic will lead to a formula that doesn't correspond to the one intended by the user.
This could be implemented like this: We throw a warning based on the graph that is built in the background by formulaic. In detail, an warning (telling the user to double check if they adhere to the formulaic grammar) is shown, iff:
- An ASTNode has an operator which is not :,*,+, or - (catching loss of terms with power functions such as
x1 ** 2) - An ASTNode with the operator * or : has two equal arguments (catching loss of terms like
x1*x1) We do not catch cases with higher order terms likex1*x2*x1(which is equivalent tox1*x2instead of{x1*x2*x2}) with this method. But we could extend 2. by throwing a warning if an argument occurs twice when you follow the child nodes until the end of the tree. However this will lead to unnecessary warnings in some cases (e.g.x1*(0*x1 + x2)).
I do not know enough about the internal mechanisms of formulaic to foresee if this will lead to unnecessary warnings in additional, a bit more exotic cases. But for sure it should warn the user in the cases you mentioned above.
- We can add a comment with a reference to the Formulaic grammar in the documentation so that everyone, in principle, knows what they have to do.
What do you think about this? I think I can implement it somewhen soon, if you or @dlinzner-bcs etc agree.
Cheers, Aaron :)