Ax icon indicating copy to clipboard operation
Ax copied to clipboard

Fully Bayesian Multi-Objective Optimization using qNEHVI + SAASBO sample code saasbo_nehvi.py crashes

Open Ted12345678 opened this issue 3 years ago • 5 comments

I set the N_BATCH = 100.

After "Iteration: 22, HV: 0.3957313411292245", the HV has no improvement.

It crashed at Iteration 58, with the following error: "RuntimeError: mean shape torch.Size([16, 258, 2]) is incompatible with covariance shape torch.Size([16, 8256, 8256])" .

Actually if run other MO functions like ZDT1 in this code, it has the similar issues, the HV cannot improve, and eventually crashes.

I am running python 3.9.6 and AX 0.2.5.1 on RHEL 7.4.

Any ideas?

Ted12345678 avatar Jun 02 '22 05:06 Ted12345678

What exactly is the „saasbo_nehvi.py“ code you are running? Can you share a reproducible example?

@dme65, @sdaulton

Balandat avatar Jun 02 '22 12:06 Balandat

I downloaded the code from https://ax.dev/files//saasbo_nehvi.py in the AX tutorial page.

I increased the N_BATCH to 100, hoped the HV could be improved through more iterations.

saasbo_nehvi.zip

Thanks.

Ted12345678 avatar Jun 02 '22 16:06 Ted12345678

I tried this but unfortunately the GPU ran out of memory after ~45 iterations. SAASBO + MOO can be quite expensive, especially if the Pareto Frontier of the function consists of many points.

Actually if run other MO functions like ZDT1 in this code, it has the similar issues, the HV cannot improve, and eventually crashes.

Are these crashes due to the same shape mismatch error that you pointed to above?

Balandat avatar Jun 05 '22 15:06 Balandat

Thanks, Balandat, for trying it. My 8G GPU is not enough either. I run it by CPU.

Yes, so far the crashes I encountered were all from the shape mismatch error.

Another issue is that usually the HV don't improve after the 1st iteration.

Ted12345678 avatar Jun 05 '22 23:06 Ted12345678

I run 5 trials in a roll and 50 iterations/per trial, but the HV indicator had no improvement since the first iteration. For example the result from one of the 5 trials:

Iteration: 0, HV: 119.90011221307591
Iteration: 1, HV: 119.90011221307591
Iteration: 2, HV: 119.90011221307591
Iteration: 3, HV: 119.90011221307591
Iteration: 4, HV: 119.90011221307591
Iteration: 5, HV: 119.90011221307591
...
Iteration: 49, HV: 119.90011221307591

Do you have any tips about how to make the HV indictor move in SAASMOO? For this ZDT1 case, the max_hv is 120.666.

Thanks.

Ted12345678 avatar Jun 29 '22 00:06 Ted12345678

This bug was addressed; @dme65 will provide more details and close this.

lena-kashtelyan avatar Dec 06 '22 20:12 lena-kashtelyan

It crashed at Iteration 58, with the following error: "RuntimeError: mean shape torch.Size([16, 258, 2]) is incompatible with covariance shape torch.Size([16, 8256, 8256])"

The issue isn't fixed, but we know why it happens! The issue is a very complicated permutation bug in GPyTorch's lazy tensor that affects our fully Bayesian models when GPyTorch switches from eager mode to approximate computations. The easiest solution for now is to force GPyTorch to always use eager mode in your optimization loop, which can be accomplished by doing something like this:

with gpytorch.settings.fast_computations(
    log_prob=False, covar_root_decomposition=False, solves=False
), gpytorch.settings.max_eager_kernel_size(
    MAX_EAGER_KERNEL_SIZE
):
    # generate new trials here

dme65 avatar Dec 06 '22 22:12 dme65

Thank you for chiming in, @dme65! I'll keep this open for now then until the upstream issue is resolved. Do you have a sense of when that might be?

lena-kashtelyan avatar Dec 07 '22 17:12 lena-kashtelyan

I think @Balandat has spent a lot of time trying to fix the underlying GPyTorch bug. I personally think we should turn off the approximate computations in GPyTorch since this isn't really something we rely on in Ax and doing so will fix this issue. I think @saitcakmak feels similarly.

dme65 avatar Dec 07 '22 17:12 dme65

Thanks @lena-kashtelyan and @dme65 for the feedback. Tried the workaround from @dme65, put the optimization loop under the following code:

with gpytorch.settings.fast_computations(log_prob=False, covar_root_decomposition=False, solves=False), gpytorch.settings.max_eager_kernel_size(512):

I still got the same crash: RuntimeError: mean shape torch.Size([16, 258, 2]) is incompatible with covariance shape torch.Size([16, 8256, 8256])

Ted12345678 avatar Dec 09 '22 19:12 Ted12345678

Hi @Ted12345678. You should use a larger max_eager_kernel_size (I tend to just use float("inf") :) ). Your current kernel size is 258 x 2 > 512, so the setting doesn't really do anything.

saitcakmak avatar Dec 09 '22 20:12 saitcakmak

For a broader fix, https://github.com/pytorch/botorch/pull/1547 proposes to turn off fast computations and increase the max_cholesky_size & max_eager_kernel_size to 4096 by default in BoTorch (thus also in Ax).

saitcakmak avatar Dec 09 '22 20:12 saitcakmak

Thanks, @saitcakmak, the crash is gone using float('inf').

Ted12345678 avatar Dec 12 '22 18:12 Ted12345678

This was fixed in https://github.com/pytorch/botorch/pull/1547

saitcakmak avatar Jul 25 '23 19:07 saitcakmak