Ax icon indicating copy to clipboard operation
Ax copied to clipboard

[GENERAL SUPPORT]: Slow MultiObjective Small Space Parallel Get Trials

Open chanansh opened this issue 5 months ago • 12 comments

Question

I have a budget of batch_size=150 parallel runs on the blackbox. I have setup a strategy to have SOBOL with batch_size followed by BOTORCH The generation of the N trials by ax.get_next_trials is so slow, making the optimisation useless. What am I doing wrong? The objective is MO with 2 scalars (no noise). The feature space is very small (see below just 2500 possible values). It takes about 5-10 second per trial on M3 macbook pro (only one cpu is used).

Please provide any relevant code snippet if applicable.

search_space = SearchSpace(
    parameters=[
        # This is a discrete choice of 34 integer values, sorted in ascending order.
        ChoiceParameter(
            name="parameter_A",
            values=[128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 6144, 8192],
            is_ordered=True,
            value_type="int"
        ),
        # This represents a search over powers of 2, from 2^1 to 2^6.
        RangeParameter(
            name="parameter_B",
            lower=1,
            upper=6,
            value_type="int",
        ),
        # This represents 2^0.
        ChoiceParameter(
            name="parameter_C",
            values=[0],
            is_ordered=True,
            value_type="int"
        ),
        # This represents a search over powers of 2, from 2^0 to 2^5.
        RangeParameter(
            name="parameter_D",
            lower=0,
            upper=5,
            value_type="int",
        ),
        # This represents a search over powers of 2, from 2^0 to 2^3.
        RangeParameter(
            name="parameter_E",
            lower=0,
            upper=3,
            value_type="int",
        ),
        # This represents a search over powers of 2, from 2^0 to 2^6.
        RangeParameter(
            name="parameter_F",
            lower=0,
            upper=6,
            value_type="int",
        ),
        # This represents a search over powers of 2, from 2^0 to 2^5.
        RangeParameter(
            name="parameter_G",
            lower=0,
            upper=5,
            value_type="int",
        )
    ]
)

linear_constraints=[
    (["parameter_D", "parameter_E", "parameter_F", "parameter_G"], "==", 6)
]


# generation strategy

        gs = GenerationStrategy(steps=[
            GenerationStep(model=Generators.SOBOL, num_trials=sobol_trials, max_parallelism=batch_size),
            GenerationStep(
                model=Generators.BOTORCH_MODULAR, 
                num_trials=-1, 
                max_parallelism=batch_size
            ),
        ])

Code of Conduct

  • [x] I agree to follow this Ax's Code of Conduct

chanansh avatar Jul 16 '25 19:07 chanansh

This is a very high parallelism setting, the Hypervolume-based multi-objective algorithms we use by default in the BOTORCH_MODULAR strategy are somewhat computationally intensive, so I'm not surprised this takes a long time.

With this batch size, it could make sense to use a faster strategy, such as qParEGO (see this botorch tutorial for more details on this).

To use this, you need to configure your BOTORCH_MODULAR step as follows:

from botorch.acquisition.multi_objective.parego import qLogNParEGO

GenerationStep(
    model=Generators.BOTORCH_MODULAR, 
    model_kwargs={"botorch_acqf_class": qLogNParEGO},
    num_trials=-1, 
    max_parallelism=batch_size,
),

It takes about 5-10 second per trial on M3 macbook pro (only one cpu is used).

So 5-10 seconds to generate 150 points? Or to generate one point?

Taking a step back: How many batches are you able to run? If the total cardinality of the search space is 2500 and you can run 150 evaluations per batch, you could fully enumerate the search space in 17 batches. So I assume you can run fewer than that? How long does the evaluation of one batch take?

linear_constraints=[(["parameter_D", "parameter_E", "parameter_F", "parameter_G"], "==", 6)]

This doesn't seem to be used, but just wanted to point out that linear equality constraints are currently not supported (only inequality constraints).

Balandat avatar Jul 17 '25 13:07 Balandat

Hi,

  • 10 second for a single point.
  • linear constraint is a typo, I meant inequality.
  • I meant to run 2-3 batches.

chanansh avatar Jul 27 '25 09:07 chanansh

@Balandat - would not a for loop for suggestions (aka get_trial) will result in the same suggestion over and over? if not, why? Isn't it just looking for max on the acq fun?

chanansh avatar Aug 05 '25 10:08 chanansh

would not a for loop for suggestions (aka get_trial) will result in the same suggestion over and over? if not, why? Isn't it just looking for max on the acq fun?

It would not; the API will add the features of trials in CANDIDATE stage (which new trials are in by default) as "pending points" (see here), and by doing so the acquisition function can take these points into account when generating the next suggestion. In other words, with pending points the acquisition function will have a different maximum then when first called.

Balandat avatar Aug 05 '25 12:08 Balandat

@Balandat I think I found an issue, please correct me if I am wrong:

ax.get_next_trials is naive and just does a for-loop (albeit you said it actually keeps a state of candidates):

        for _ in range(max_trials):
            try:
                params, trial_index = self.get_next_trial(
                    ttl_seconds=ttl_seconds, fixed_features=fixed_features
                )
                trials_dict[trial_index] = params
            except OptimizationComplete as err:
                logger.info(
                    f"Encountered exception indicating optimization completion: {err}"
                )
                return trials_dict, True

However the qEHVI family of algorithm gets q=batch_size as an input to do it in a "smart" way. For example, in the BoTorch docs:

    # optimize
    candidates, _ = optimize_acqf(
        acq_function=acq_func,
        bounds=standard_bounds,
        q=BATCH_SIZE,
        num_restarts=NUM_RESTARTS,
        raw_samples=RAW_SAMPLES,  # used for intialization heuristic
        options={"batch_limit": 5, "maxiter": 200},
        sequential=True,
    )

Notice how q is sent to get q candidates.

Could this be the issue with batch MOO in AxDev?

chanansh avatar Aug 05 '25 16:08 chanansh

@chanansh which version of Ax are you referring to? The current version has an updated API and doesn't define a get_next_trial() method (only a get_next_trials() one).

Balandat avatar Aug 05 '25 16:08 Balandat

ax-platform 1.0.0 Adaptive Experimentation

strange. let me check.

I can see the signature

    def get_next_trial(
        self,
        ttl_seconds: int | None = None,
        force: bool = False,
        fixed_features: FixedFeatures | None = None,
    ) -> tuple[TParameterization, int]:

and I can confirm that version.py contains:

__version__ = version = '1.0.0'
__version_tuple__ = version_tuple = (1, 0, 0)

get_next_trials seems to be just a for loop inside (with some constrains)

chanansh avatar Aug 05 '25 18:08 chanansh

I think you are looking at the legacy AxClient, not the new Client API (showcased on Ax website, e.g. here: https://ax.dev/docs/tutorials/quickstart/), @chanansh ?

lena-kashtelyan avatar Sep 19 '25 18:09 lena-kashtelyan

I was on 1.0.0 I will check. p.s. there is no updated documentation for MOBO - https://ax.dev/docs/0.5.0/tutorials/multiobjective_optimization/ - if you choose a newer version it does not exist anymore.

chanansh avatar Sep 21 '25 10:09 chanansh

Hi @chanansh , sorry we'd lost track of this issue! Multi-objective optimization is built into the Client API natively: please check out this recipe: https://ax.dev/docs/recipes/multi-objective-optimization.

lena-kashtelyan avatar Oct 12 '25 21:10 lena-kashtelyan

Hi @lena-kashtelyan , I still have issues, can you please help? I have generated a reproducible code. The first batch of optimization is SOBOL and I get some measurements. Then, when I run the second batch, the code hangs (many processes open) but it does not finish (even if i ask a single trial).

⚠️ Now attempting to request a new trial... (This is where the process hangs and spawns many processes)

[INFO 10-20 17:40:10] ax.api.client: Generated new trial 173 with parameters {'param_a': 2, 'param_c': 0, 'param_d': 0, 'param_f': 6, 'param_h': 1, 'param_k': 0, 'param_b': 7} using GenerationNode BoTorch. ✓ Got 1 trial(s) in 164.18 seconds

Ax Bug Report: Process Hangs and Spawns Many Processes on get_next_trials()

Description

When loading an Ax experiment with 173 trials (147 completed, 26 failed) from JSON and calling client.get_next_trials(), the process hangs indefinitely and spawns many child processes, eventually consuming all system resources.

Environment

  • Ax version: 1.1.2
  • BoTorch version: >=0.15.1 (dependency of Ax)
  • Python version: 3.12
  • OS: Linux 6.8.0-60-generic

Hardware

  • CPU: AMD EPYC-Genoa Processor (96 CPUs, 24 cores per socket)
  • Memory: 62 GB RAM
  • GPU: None

Experiment Configuration

  • Optimization type: Multi-objective (2 objectives)
  • Parameters: 7 parameters
    • 1 ChoiceParameter (ordered, integer type, 17 choices)
    • 6 RangeParameters (integer type)
  • Parameter constraints: 2 linear inequality constraints
  • Generation strategy:
    • Initial: Sobol (173 trials threshold)
    • Subsequent: BoTorch with qLogNParEGO acquisition function - (I was told it is faster, the problem persist even if it is the default acquisition).

Reproduction Steps

  1. Download the attached files:

    • obscured_experiment.json - The experiment snapshot
    • load_and_ask_trial.py - Minimal reproduction script
  2. Install Ax:

    pip install ax-platform
    
  3. Run the reproduction script:

    python load_and_ask_trial.py
    
  4. Observe:

    • The script successfully loads the experiment
    • Prints experiment summary showing 173 trials (147 completed)
    • Hangs indefinitely when calling client.get_next_trials(max_trials=1)
    • Spawns many Python child processes (visible with ps aux | grep python)

Expected Behavior

client.get_next_trials() should:

  1. Transition from Sobol to BoTorch generation strategy (threshold met)
  2. Fit a model on the existing 147 completed trials
  3. Propose a new trial using the qLogNParEGO acquisition function
  4. Return within a reasonable time (seconds to minutes)

Actual Behavior

  • The process hangs indefinitely
  • Spawns many child processes continuously
  • No error message is displayed
  • Process must be killed manually
  • System resources (CPU/memory) get exhausted

Additional Notes

  • The experiment was created and trials were run successfully using the same configuration
  • Only loading from JSON and requesting the next trial triggers the issue
  • The experiment has valid completed trials with data
  • Parameter constraints are satisfied by all completed trials

Files Attached

  1. obscured_experiment.json - Anonymized experiment snapshot
  2. load_and_ask_trial.py - Minimal reproduction script

Workarounds Attempted

None found so far.

obscured_experiment.json load_and_ask_trial.py

chanansh avatar Oct 20 '25 14:10 chanansh

kind reminder @lena-kashtelyan

chanansh avatar Nov 25 '25 08:11 chanansh