[GENERAL SUPPORT]: Slow MultiObjective Small Space Parallel Get Trials
Question
I have a budget of batch_size=150 parallel runs on the blackbox.
I have setup a strategy to have SOBOL with batch_size followed by BOTORCH
The generation of the N trials by ax.get_next_trials is so slow, making the optimisation useless.
What am I doing wrong?
The objective is MO with 2 scalars (no noise).
The feature space is very small (see below just 2500 possible values).
It takes about 5-10 second per trial on M3 macbook pro (only one cpu is used).
Please provide any relevant code snippet if applicable.
search_space = SearchSpace(
parameters=[
# This is a discrete choice of 34 integer values, sorted in ascending order.
ChoiceParameter(
name="parameter_A",
values=[128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096, 6144, 8192],
is_ordered=True,
value_type="int"
),
# This represents a search over powers of 2, from 2^1 to 2^6.
RangeParameter(
name="parameter_B",
lower=1,
upper=6,
value_type="int",
),
# This represents 2^0.
ChoiceParameter(
name="parameter_C",
values=[0],
is_ordered=True,
value_type="int"
),
# This represents a search over powers of 2, from 2^0 to 2^5.
RangeParameter(
name="parameter_D",
lower=0,
upper=5,
value_type="int",
),
# This represents a search over powers of 2, from 2^0 to 2^3.
RangeParameter(
name="parameter_E",
lower=0,
upper=3,
value_type="int",
),
# This represents a search over powers of 2, from 2^0 to 2^6.
RangeParameter(
name="parameter_F",
lower=0,
upper=6,
value_type="int",
),
# This represents a search over powers of 2, from 2^0 to 2^5.
RangeParameter(
name="parameter_G",
lower=0,
upper=5,
value_type="int",
)
]
)
linear_constraints=[
(["parameter_D", "parameter_E", "parameter_F", "parameter_G"], "==", 6)
]
# generation strategy
gs = GenerationStrategy(steps=[
GenerationStep(model=Generators.SOBOL, num_trials=sobol_trials, max_parallelism=batch_size),
GenerationStep(
model=Generators.BOTORCH_MODULAR,
num_trials=-1,
max_parallelism=batch_size
),
])
Code of Conduct
- [x] I agree to follow this Ax's Code of Conduct
This is a very high parallelism setting, the Hypervolume-based multi-objective algorithms we use by default in the BOTORCH_MODULAR strategy are somewhat computationally intensive, so I'm not surprised this takes a long time.
With this batch size, it could make sense to use a faster strategy, such as qParEGO (see this botorch tutorial for more details on this).
To use this, you need to configure your BOTORCH_MODULAR step as follows:
from botorch.acquisition.multi_objective.parego import qLogNParEGO
GenerationStep(
model=Generators.BOTORCH_MODULAR,
model_kwargs={"botorch_acqf_class": qLogNParEGO},
num_trials=-1,
max_parallelism=batch_size,
),
It takes about 5-10 second per trial on M3 macbook pro (only one cpu is used).
So 5-10 seconds to generate 150 points? Or to generate one point?
Taking a step back: How many batches are you able to run? If the total cardinality of the search space is 2500 and you can run 150 evaluations per batch, you could fully enumerate the search space in 17 batches. So I assume you can run fewer than that? How long does the evaluation of one batch take?
linear_constraints=[(["parameter_D", "parameter_E", "parameter_F", "parameter_G"], "==", 6)]
This doesn't seem to be used, but just wanted to point out that linear equality constraints are currently not supported (only inequality constraints).
Hi,
- 10 second for a single point.
- linear constraint is a typo, I meant inequality.
- I meant to run 2-3 batches.
@Balandat - would not a for loop for suggestions (aka get_trial) will result in the same suggestion over and over? if not, why? Isn't it just looking for max on the acq fun?
would not a for loop for suggestions (aka get_trial) will result in the same suggestion over and over? if not, why? Isn't it just looking for max on the acq fun?
It would not; the API will add the features of trials in CANDIDATE stage (which new trials are in by default) as "pending points" (see here), and by doing so the acquisition function can take these points into account when generating the next suggestion. In other words, with pending points the acquisition function will have a different maximum then when first called.
@Balandat I think I found an issue, please correct me if I am wrong:
ax.get_next_trials is naive and just does a for-loop (albeit you said it actually keeps a state of candidates):
for _ in range(max_trials):
try:
params, trial_index = self.get_next_trial(
ttl_seconds=ttl_seconds, fixed_features=fixed_features
)
trials_dict[trial_index] = params
except OptimizationComplete as err:
logger.info(
f"Encountered exception indicating optimization completion: {err}"
)
return trials_dict, True
However the qEHVI family of algorithm gets q=batch_size as an input to do it in a "smart" way.
For example, in the BoTorch docs:
# optimize
candidates, _ = optimize_acqf(
acq_function=acq_func,
bounds=standard_bounds,
q=BATCH_SIZE,
num_restarts=NUM_RESTARTS,
raw_samples=RAW_SAMPLES, # used for intialization heuristic
options={"batch_limit": 5, "maxiter": 200},
sequential=True,
)
Notice how q is sent to get q candidates.
Could this be the issue with batch MOO in AxDev?
@chanansh which version of Ax are you referring to? The current version has an updated API and doesn't define a get_next_trial() method (only a get_next_trials() one).
ax-platform 1.0.0 Adaptive Experimentation
strange. let me check.
I can see the signature
def get_next_trial(
self,
ttl_seconds: int | None = None,
force: bool = False,
fixed_features: FixedFeatures | None = None,
) -> tuple[TParameterization, int]:
and I can confirm that version.py contains:
__version__ = version = '1.0.0'
__version_tuple__ = version_tuple = (1, 0, 0)
get_next_trials seems to be just a for loop inside (with some constrains)
I think you are looking at the legacy AxClient, not the new Client API (showcased on Ax website, e.g. here: https://ax.dev/docs/tutorials/quickstart/), @chanansh ?
I was on 1.0.0 I will check. p.s. there is no updated documentation for MOBO - https://ax.dev/docs/0.5.0/tutorials/multiobjective_optimization/ - if you choose a newer version it does not exist anymore.
Hi @chanansh , sorry we'd lost track of this issue! Multi-objective optimization is built into the Client API natively: please check out this recipe: https://ax.dev/docs/recipes/multi-objective-optimization.
Hi @lena-kashtelyan , I still have issues, can you please help? I have generated a reproducible code. The first batch of optimization is SOBOL and I get some measurements. Then, when I run the second batch, the code hangs (many processes open) but it does not finish (even if i ask a single trial).
⚠️ Now attempting to request a new trial... (This is where the process hangs and spawns many processes)
[INFO 10-20 17:40:10] ax.api.client: Generated new trial 173 with parameters {'param_a': 2, 'param_c': 0, 'param_d': 0, 'param_f': 6, 'param_h': 1, 'param_k': 0, 'param_b': 7} using GenerationNode BoTorch. ✓ Got 1 trial(s) in 164.18 seconds
Ax Bug Report: Process Hangs and Spawns Many Processes on get_next_trials()
Description
When loading an Ax experiment with 173 trials (147 completed, 26 failed) from JSON and calling client.get_next_trials(), the process hangs indefinitely and spawns many child processes, eventually consuming all system resources.
Environment
- Ax version: 1.1.2
- BoTorch version: >=0.15.1 (dependency of Ax)
- Python version: 3.12
- OS: Linux 6.8.0-60-generic
Hardware
- CPU: AMD EPYC-Genoa Processor (96 CPUs, 24 cores per socket)
- Memory: 62 GB RAM
- GPU: None
Experiment Configuration
- Optimization type: Multi-objective (2 objectives)
- Parameters: 7 parameters
- 1 ChoiceParameter (ordered, integer type, 17 choices)
- 6 RangeParameters (integer type)
- Parameter constraints: 2 linear inequality constraints
- Generation strategy:
- Initial: Sobol (173 trials threshold)
- Subsequent: BoTorch with qLogNParEGO acquisition function - (I was told it is faster, the problem persist even if it is the default acquisition).
Reproduction Steps
-
Download the attached files:
obscured_experiment.json- The experiment snapshotload_and_ask_trial.py- Minimal reproduction script
-
Install Ax:
pip install ax-platform -
Run the reproduction script:
python load_and_ask_trial.py -
Observe:
- The script successfully loads the experiment
- Prints experiment summary showing 173 trials (147 completed)
- Hangs indefinitely when calling
client.get_next_trials(max_trials=1) - Spawns many Python child processes (visible with
ps aux | grep python)
Expected Behavior
client.get_next_trials() should:
- Transition from Sobol to BoTorch generation strategy (threshold met)
- Fit a model on the existing 147 completed trials
- Propose a new trial using the qLogNParEGO acquisition function
- Return within a reasonable time (seconds to minutes)
Actual Behavior
- The process hangs indefinitely
- Spawns many child processes continuously
- No error message is displayed
- Process must be killed manually
- System resources (CPU/memory) get exhausted
Additional Notes
- The experiment was created and trials were run successfully using the same configuration
- Only loading from JSON and requesting the next trial triggers the issue
- The experiment has valid completed trials with data
- Parameter constraints are satisfied by all completed trials
Files Attached
obscured_experiment.json- Anonymized experiment snapshotload_and_ask_trial.py- Minimal reproduction script
Workarounds Attempted
None found so far.
kind reminder @lena-kashtelyan