bofire icon indicating copy to clipboard operation
bofire copied to clipboard

[Bug]: silent failure for model_type strings whose quadratic terms are not separated by {}'s

Open rosonaeldred opened this issue 4 months ago • 4 comments

What happened?

Having non-deliminated-by {}'s quadratic terms just causes Formulaic to drop them. The hint i had that the preferred way of dealing with quadratics in Formulaic is via {}s was one line of one test, no comment. .

Request: do not just fail silently. Add a regex that either fixes the situation or error out.

  • That is, the following should all be equivalent: {x**2}, {x*x}, {x * x}, x**2, x*x, x * x, x:x
  • Currently, only these are equivalent {x**2}, {x*x}, {x * x}, {x : x}
  • and these are equivalently not seen and treated as null: x**2, x*x, x * x, x:x

Moreover, Formula() applied to the output of get_formula_from_string where model_type is a shortcut (e.g. "fully-quadratic") should be idempotent. Using the domain below, we have formula = get_formula_from_string(model_type="fully-quadratic", inputs=inputs) produces the string "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 + x1:x2 + x1:x3 + x2:x3" and applying Formula() to this string yields "1 + x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3". get_formula_from_string should instead return "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2} + x1:x2 + x1:x3 + x2:x3", which remains fixed under application of Formula() (desired behavior).

Behavior demonstration:

import os

from bofire.data_models.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (
    ContinuousInput,
    ContinuousOutput,
)
from bofire.strategies.doe.utils import get_formula_from_string

models = {
    "linear-and-quadratic": "linear-and-quadratic",
    "model_1": "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2",
    "model_1a": "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}",
    "model_2": "1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3",
    "model_2a": "1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}",
    "model_3": "1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3",
    "model_3a": "1 + x1 + x2 + x3 + {x1 * x1} + {x2 * x2} + {x3 * x3}",
}

for name, expr in models.items():
    formula = get_formula_from_string(model_type=expr, inputs=inputs)
    print(
        f"{name}: {expr}.\n\tAs a Formula: {formula}.\n\tExpected experiments: {len(formula) + 3}"
    )

Outputs:

Image

See also

Image

This is the functionality going into 'get_required_number_of_experiments" in DoEStrategy:

Image

and inside https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L62

then

https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L141

This will also fail silently inside DoEOptimality Criterion here https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/data_models/strategies/doe.py#L51

Please provide a minimal, reproducible example of the unexpected behavior.

see above

Please paste any relevant traceback/logs produced by the example provided.


BoFire Version

0.2.1

Python Version

3.10.12

Operating System

WSL: Ubuntu-22.04

rosonaeldred avatar Jul 17 '25 15:07 rosonaeldred

len(Formula(model string)) is not used consistently everywhere, see https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/objective.py#L70


        self.formula = deepcopy(formula)
        self.model_terms_string_expression = convert_formula_to_string(
            domain=domain, formula=formula
        )
        self.n_model_terms = len(list(np.array(formula, dtype=str)))

rosonaeldred avatar Jul 24 '25 06:07 rosonaeldred

Also concerning https://github.com/experimental-design/bofire/blob/7fb596e0968e48c82bb4e64e30bbef9ed1477af4/bofire/strategies/doe/utils.py#L155-L184

rosonaeldred avatar Jul 28 '25 12:07 rosonaeldred

This is more complex than I can handle. Need an assist from @dlinzner-bcs @R-M-Lee @bertiqwerty e.g.

rosonaeldred avatar Aug 04 '25 14:08 rosonaeldred

Hi @rosonaeldred,

as you maybe noticed I don't really find time to work on bofire stuff at the moment anymore :/ But now I am sitting in the train for quite a while and I had time to look into this issue a bit. Here is what I can tell about this issue, hoping it helps:

  1. Regarding the point about get_formula_from_string(): Here I do not see a problem since it returns not a string but a formula object, which indeed contains all intended terms and remains unchanged w.r.t. to repeated applications of Formula() (see code and output below). I think we do not have control over the string representation of Formula objects (since they are defined in formulaic). So we will have to will live with the fact that the wavy brackets will be lost in the string representations, the returned Formula objects are, however, still unchanged under repeated application of Formula().
import os

from bofire.data_models.api import Domain, Inputs, Outputs
from bofire.data_models.features.api import (
    ContinuousInput,
    ContinuousOutput,
)
from bofire.strategies.doe.utils import get_formula_from_string
from bofire.data_models.domain.api import Inputs
from formulaic import Formula

domain = Domain.from_lists(
    inputs=[
        ContinuousInput(
            key=f"x{i + 1}",
            bounds=(0, 1),
        )
        for i in range(3)
    ],
    outputs=[ContinuousOutput(key="y")],
)

models = {
    "linear-and-quadratic": "linear-and-quadratic",
    "model_1": "1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2",
    "model_1a": "1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}",
    "model_2": "1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3",
    "model_2a": "1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}",
    "model_3": "1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3",
    "model_3a": "1 + x1 + x2 + x3 + {x1 * x1} + {x2 * x2} + {x3 * x3}",
}

for name, expr in models.items():
    formula = get_formula_from_string(model_type=expr, inputs=domain.inputs)
    print(name)
    print("expression:        ", expr)
    print("formula:           ", formula, ", ", list(formula))
    print("Formular(formula): ", Formula(formula), ", ", list(Formula(formula)))
    print()

output:

linear-and-quadratic
expression:         linear-and-quadratic
formula:            1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 ,  [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
Formular(formula):  1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 ,  [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]

model_1
expression:         1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2
formula:            1 + x1 + x2 + x3 ,  [1, x1, x2, x3]
Formular(formula):  1 + x1 + x2 + x3 ,  [1, x1, x2, x3]

model_1a
expression:         1 + x1 + x2 + x3 + {x1 ** 2} + {x2 ** 2} + {x3 ** 2}
formula:            1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 ,  [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]
Formular(formula):  1 + x1 + x2 + x3 + x1 ** 2 + x2 ** 2 + x3 ** 2 ,  [1, x1, x2, x3, x1 ** 2, x2 ** 2, x3 ** 2]

model_2
expression:         1 + x1 + x2 + x3 + x1*x1 + x2*x2 + x3*x3
formula:            1 + x1 + x2 + x3 ,  [1, x1, x2, x3]
Formular(formula):  1 + x1 + x2 + x3 ,  [1, x1, x2, x3]

model_2a
expression:         1 + x1 + x2 + x3 + {x1*x1} + {x2*x2} + {x3*x3}
formula:            1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3 ,  [1, x1, x2, x3, x1 * x1, x2 * x2, x3 * x3]
Formular(formula):  1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3 ,  [1, x1, x2, x3, x1 * x1, x2 * x2, x3 * x3]

model_3
expression:         1 + x1 + x2 + x3 + x1 * x1 + x2 * x2 + x3 * x3
formula:            1 + x1 + x2 + x3 ,  [1, x1, x2, x3]
Formular(formula):  1 + x1 + x2 + x3 ,  [1, x1, x2, x3]
  1. The parsing works according to the Formula grammar (https://matthewwardrop.github.io/formulaic/latest/guides/grammar/) described in the formulaic documentation.

I think it will be a lot of work to adapt this grammar (i.e. "fix" formula expressions automatically) while still correctly parsing the most general model formulas with formulaic. But indeed the grammar can be counter-intuitive. Therefore I would think it is better to take the second approach that you suggested and throw a warning (better not an error imo) if we suspect that the parsing in formulaic will lead to a formula that doesn't correspond to the one intended by the user.

This could be implemented like this: We throw a warning based on the graph that is built in the background by formulaic. In detail, an warning (telling the user to double check if they adhere to the formulaic grammar) is shown, iff:

  1. An ASTNode has an operator which is not :,*,+, or - (catching loss of terms with power functions such as x1 ** 2)
  2. An ASTNode with the operator * or : has two equal arguments (catching loss of terms like x1*x1) We do not catch cases with higher order terms like x1*x2*x1 (which is equivalent to x1*x2 instead of {x1*x2*x2}) with this method. But we could extend 2. by throwing a warning if an argument occurs twice when you follow the child nodes until the end of the tree. However this will lead to unnecessary warnings in some cases (e.g. x1*(0*x1 + x2)).

I do not know enough about the internal mechanisms of formulaic to foresee if this will lead to unnecessary warnings in additional, a bit more exotic cases. But for sure it should warn the user in the cases you mentioned above.

  1. We can add a comment with a reference to the Formulaic grammar in the documentation so that everyone, in principle, knows what they have to do.

What do you think about this? I think I can implement it somewhen soon, if you or @dlinzner-bcs etc agree.

Cheers, Aaron :)

Osburg avatar Aug 04 '25 22:08 Osburg