Ax icon indicating copy to clipboard operation
Ax copied to clipboard

[Bug]: Slow MOBO

Open chanansh opened this issue 1 month ago • 1 comments

What happened?

Hi @lena-kashtelyan , I still have issues, can you please help? I have generated a reproducible code. The first batch of optimization is SOBOL and I get some measurements. Then, when I run the second batch, the code hangs (many processes open) but it does not finish (even if i ask a single trial).

⚠️ Now attempting to request a new trial... (This is where the process hangs and spawns many processes)

[INFO 10-20 17:40:10] ax.api.client: Generated new trial 173 with parameters {'param_a': 2, 'param_c': 0, 'param_d': 0, 'param_f': 6, 'param_h': 1, 'param_k': 0, 'param_b': 7} using GenerationNode BoTorch. ✓ Got 1 trial(s) in 164.18 seconds

Ax Bug Report: Process Hangs and Spawns Many Processes on get_next_trials()

Description

When loading an Ax experiment with 173 trials (147 completed, 26 failed) from JSON and calling client.get_next_trials(), the process hangs indefinitely and spawns many child processes, eventually consuming all system resources.

Environment

  • Ax version: 1.1.2
  • BoTorch version: >=0.15.1 (dependency of Ax)
  • Python version: 3.12
  • OS: Linux 6.8.0-60-generic

Hardware

  • CPU: AMD EPYC-Genoa Processor (96 CPUs, 24 cores per socket)
  • Memory: 62 GB RAM
  • GPU: None

Experiment Configuration

  • Optimization type: Multi-objective (2 objectives)
  • Parameters: 7 parameters
    • 1 ChoiceParameter (ordered, integer type, 17 choices)
    • 6 RangeParameters (integer type)
  • Parameter constraints: 2 linear inequality constraints
  • Generation strategy:
    • Initial: Sobol (173 trials threshold)
    • Subsequent: BoTorch with qLogNParEGO acquisition function - (I was told it is faster, the problem persist even if it is the default acquisition).

Reproduction Steps

  1. Download the attached files:

    • obscured_experiment.json - The experiment snapshot
    • load_and_ask_trial.py - Minimal reproduction script
  2. Install Ax:

    pip install ax-platform
    
  3. Run the reproduction script:

    python load_and_ask_trial.py
    
  4. Observe:

    • The script successfully loads the experiment
    • Prints experiment summary showing 173 trials (147 completed)
    • Hangs indefinitely when calling client.get_next_trials(max_trials=1)
    • Spawns many Python child processes (visible with ps aux | grep python)

Expected Behavior

client.get_next_trials() should:

  1. Transition from Sobol to BoTorch generation strategy (threshold met)
  2. Fit a model on the existing 147 completed trials
  3. Propose a new trial using the qLogNParEGO acquisition function
  4. Return within a reasonable time (seconds to minutes)

Actual Behavior

  • The process hangs indefinitely
  • Spawns many child processes continuously
  • No error message is displayed
  • Process must be killed manually
  • System resources (CPU/memory) get exhausted

Additional Notes

  • The experiment was created and trials were run successfully using the same configuration
  • Only loading from JSON and requesting the next trial triggers the issue
  • The experiment has valid completed trials with data
  • Parameter constraints are satisfied by all completed trials

Files Attached

  1. obscured_experiment.json - Anonymized experiment snapshot
  2. load_and_ask_trial.py - Minimal reproduction script

Workarounds Attempted

None found so far.

obscured_experiment.json load_and_ask_trial.py

Please provide a minimal, reproducible example of the unexpected behavior.

Reproduction Steps

  1. Download the attached files:

    • obscured_experiment.json - The experiment snapshot
    • load_and_ask_trial.py - Minimal reproduction script
  2. Install Ax:

    pip install ax-platform
    
  3. Run the reproduction script:

    python load_and_ask_trial.py
    
  4. Observe:

    • The script successfully loads the experiment
    • Prints experiment summary showing 173 trials (147 completed)
    • Hangs indefinitely when calling client.get_next_trials(max_trials=1)
    • Spawns many Python child processes (visible with ps aux | grep python)

Please paste any relevant traceback/logs produced by the example provided.


Ax Version

latest 2.1.2

Python Version

3.12

Operating System

ubuntu

(Optional) Describe any potential fixes you've considered to the issue outlined above.

No response

Pull Request

None

Code of Conduct

  • [x] I agree to follow Ax's Code of Conduct

chanansh avatar Nov 25 '25 08:11 chanansh

Hello there! I ran your provided code, but the issue doesn't reproduce for me (on Ax 1.2.1). I am able to get a trial (or even multiple) without issue.

I recommend:

  1. Upgrade Ax to 1.2.1
  2. Set logger level to DEBUG to better understand where the code is hanging
  3. Run a debugger and step through the code to understand where the code is hanging

Cesar-Cardoso avatar Nov 25 '25 22:11 Cesar-Cardoso