[Bug]: Slow MOBO
What happened?
Hi @lena-kashtelyan , I still have issues, can you please help? I have generated a reproducible code. The first batch of optimization is SOBOL and I get some measurements. Then, when I run the second batch, the code hangs (many processes open) but it does not finish (even if i ask a single trial).
⚠️ Now attempting to request a new trial... (This is where the process hangs and spawns many processes)
[INFO 10-20 17:40:10] ax.api.client: Generated new trial 173 with parameters {'param_a': 2, 'param_c': 0, 'param_d': 0, 'param_f': 6, 'param_h': 1, 'param_k': 0, 'param_b': 7} using GenerationNode BoTorch. ✓ Got 1 trial(s) in 164.18 seconds
Ax Bug Report: Process Hangs and Spawns Many Processes on get_next_trials()
Description
When loading an Ax experiment with 173 trials (147 completed, 26 failed) from JSON and calling client.get_next_trials(), the process hangs indefinitely and spawns many child processes, eventually consuming all system resources.
Environment
- Ax version: 1.1.2
- BoTorch version: >=0.15.1 (dependency of Ax)
- Python version: 3.12
- OS: Linux 6.8.0-60-generic
Hardware
- CPU: AMD EPYC-Genoa Processor (96 CPUs, 24 cores per socket)
- Memory: 62 GB RAM
- GPU: None
Experiment Configuration
- Optimization type: Multi-objective (2 objectives)
- Parameters: 7 parameters
- 1 ChoiceParameter (ordered, integer type, 17 choices)
- 6 RangeParameters (integer type)
- Parameter constraints: 2 linear inequality constraints
- Generation strategy:
- Initial: Sobol (173 trials threshold)
- Subsequent: BoTorch with qLogNParEGO acquisition function - (I was told it is faster, the problem persist even if it is the default acquisition).
Reproduction Steps
-
Download the attached files:
obscured_experiment.json- The experiment snapshotload_and_ask_trial.py- Minimal reproduction script
-
Install Ax:
pip install ax-platform -
Run the reproduction script:
python load_and_ask_trial.py -
Observe:
- The script successfully loads the experiment
- Prints experiment summary showing 173 trials (147 completed)
- Hangs indefinitely when calling
client.get_next_trials(max_trials=1) - Spawns many Python child processes (visible with
ps aux | grep python)
Expected Behavior
client.get_next_trials() should:
- Transition from Sobol to BoTorch generation strategy (threshold met)
- Fit a model on the existing 147 completed trials
- Propose a new trial using the qLogNParEGO acquisition function
- Return within a reasonable time (seconds to minutes)
Actual Behavior
- The process hangs indefinitely
- Spawns many child processes continuously
- No error message is displayed
- Process must be killed manually
- System resources (CPU/memory) get exhausted
Additional Notes
- The experiment was created and trials were run successfully using the same configuration
- Only loading from JSON and requesting the next trial triggers the issue
- The experiment has valid completed trials with data
- Parameter constraints are satisfied by all completed trials
Files Attached
obscured_experiment.json- Anonymized experiment snapshotload_and_ask_trial.py- Minimal reproduction script
Workarounds Attempted
None found so far.
obscured_experiment.json load_and_ask_trial.py
Please provide a minimal, reproducible example of the unexpected behavior.
Reproduction Steps
-
Download the attached files:
obscured_experiment.json- The experiment snapshotload_and_ask_trial.py- Minimal reproduction script
-
Install Ax:
pip install ax-platform -
Run the reproduction script:
python load_and_ask_trial.py -
Observe:
- The script successfully loads the experiment
- Prints experiment summary showing 173 trials (147 completed)
- Hangs indefinitely when calling
client.get_next_trials(max_trials=1) - Spawns many Python child processes (visible with
ps aux | grep python)
Please paste any relevant traceback/logs produced by the example provided.
Ax Version
latest 2.1.2
Python Version
3.12
Operating System
ubuntu
(Optional) Describe any potential fixes you've considered to the issue outlined above.
No response
Pull Request
None
Code of Conduct
- [x] I agree to follow Ax's Code of Conduct
Hello there! I ran your provided code, but the issue doesn't reproduce for me (on Ax 1.2.1). I am able to get a trial (or even multiple) without issue.
I recommend:
- Upgrade Ax to 1.2.1
- Set logger level to DEBUG to better understand where the code is hanging
- Run a debugger and step through the code to understand where the code is hanging