botorch icon indicating copy to clipboard operation
botorch copied to clipboard

[Bug]: qKnowledgeGradient shape mismatch when used with SaasFullyBayesianSingleTaskGP

Open ayesharoxx opened this issue 5 months ago • 22 comments

What happened?

Hi, I’m trying to use qKnowledgeGradient with a fully Bayesian SAAS GP (SaasFullyBayesianSingleTaskGP) in BoTorch. I'm doing so by writing a new class that inherits from both SaasFullyBayesianSingleTaskGP and FantasizeMixin. Then, I override the fantasize() method to define how fantasy data is generated for this model. I have used 256 samples, warmup of 512, thinning of 1, num_fantasies=2. However, on running the code, I keep getting a shape mismatch error even with raw_samples=1 and num_restarts=1. The error looks like this: RuntimeError: shape '[2, 1, 16, 1]' is invalid for input of size 64

I don't understand why I keep getting this error and where it is coming from. Any guidance on what might be causing this and how to properly structure the fantasy model in this context would be greatly appreciated!

Thanks in advance.

Please provide a minimal, reproducible example of the unexpected behavior.

The error occurs whenever I combine a SaasFullyBayesianSingleTaskGP with qKnowledgeGradient. I fit the SAAS model via fit_fully_bayesian_model_nuts, and then construct qKnowledgeGradient(model=model, num_fantasies=2). Calling optimize_acqf with q=1, num_restarts=1 and raw_samples=1 leads to the runtime error above. Removing KG (e.g. using qExpectedImprovement) works without issue, so the problem seems isolated to qKnowledgeGradient and the fully Bayesian SAAS model.

Please paste any relevant traceback/logs produced by the example provided.

RuntimeError: shape '[2, 1, 16, 1]' is invalid for input of size 64

BoTorch Version

v0.14.0

Python Version

No response

Operating System

No response

(Optional) Describe any potential fixes you've considered to the issue outlined above.

No response

Pull Request

None

Code of Conduct

  • [x] I agree to follow BoTorch's Code of Conduct

ayesharoxx avatar Jul 24 '25 19:07 ayesharoxx

Hi @ayesharoxx, Thanks for reaching out. fantasize and the underlying condition_on_observations are not well-tested for all possible functionalities in the fully Bayesian case, such as with KG. Could you perhaps provide a snippet to reproduce the issues you are encountering?

Notably, KG has 64 fantasies by default, so are you sure the attribute is passed correctly?

hvarfner avatar Jul 25 '25 14:07 hvarfner

Hi @hvarfner, Thanks so much for your reply. Sure, below is a minimal version of my code to reproduce the issue.

--- Objective function-Branin2d---

import math import numpy as np

lb = np.hstack((-5 * np.ones(50), 0 * np.ones(50))) # lower bounds for input domain ub = np.hstack((10 * np.ones(50), 15 * np.ones(50))) # upper bounds for input domain

def branin100(x): assert (x <= ub).all() and (x >= lb).all() x1, x2 = x[19], x[64] # Only dimensions 19 and 64 affect the value of the function t1 = x2 - 5.1 / (4 * math.pi ** 2) * x1 ** 2 + 5 / math.pi * x1 - 6 t2 = 10 * (1 - 1 / (8 * math.pi)) * np.cos(x1) return t1 ** 2 + t2 + 10

--- SAAS GP model with fantasize override ---

class SaasFullyBayesianSingleTaskGPWithFantasy( SaasFullyBayesianSingleTaskGP, FantasizeMixin ): def fantasize( self, X: torch.Tensor, sampler: Optional[MCSampler] = None, num_fantasies: int = 2, **kwargs, ) -> Model: if sampler is None: sampler = SobolQMCNormalSampler( sample_shape=torch.Size([num_fantasies]), collapse_batch_dims=True, ) X = torch.as_tensor( X, dtype=self.train_inputs[0].dtype, device=self.train_inputs[0].device, ) return FantasizeMixin.fantasize( self, X, sampler=sampler, **kwargs, )

--- Run SAASBO with KG minimal code ---

def run_saasbo_botorch(): torch.manual_seed(0) device = "cpu" dtype = torch.double d = 100 lb = np.zeros(50) ub = np.ones(50) bounds = torch.stack([torch.zeros(d), torch.ones(d)]).to(dtype=dtype)

# Generate initial data
# Initial Sobol samples
sobol = SobolEngine(dim, scramble=True, seed=seed)
X = sobol.draw(num_init_evals).to(dtype=torch_dtype, device=torch_device)
Y = torch.tensor([f(lb + (ub - lb) * x.cpu().numpy()) for x in X], dtype=torch_dtype, device=torch_device).unsqueeze(-1)


# Normalize Y
train_Y = (Y - Y.mean()) / Y.std()

# Fit SAAS GP
model = SaasFullyBayesianSingleTaskGPWithFantasy(X, Y)
fit_fully_bayesian_model_nuts(model, warmup_steps=512, num_samples=256, thinning=16)

# Define posterior transform to average MCMC samples
weights = torch.ones(2, dtype=dtype) / 2
post_tf = ScalarizedPosteriorTransform(weights=weights)

# Define KG
qkg = qKnowledgeGradient(
    model=model,
    num_fantasies=2,
    current_value=train_Y.min(),
    posterior_transform=post_tf,
)

# Optimize acquisition
candidate, _ = optimize_acqf(
    acq_function=qkg,
    bounds=bounds,
    q=1,
    raw_samples=1,
    num_restarts=1,
)

run_saasbo_botorch()

ayesharoxx avatar Jul 25 '25 14:07 ayesharoxx

Thanks. On my end, I ran class FantasizeSaasFullyBayesianSingleTaskGP( SaasFullyBayesianSingleTaskGP, FantasizeMixin): pass and set up KG the same way.

It seems like the issue is in AbstractFullyBayesianSingleTaskGP.condition_on_observations, which doesn't handle the three batch dimensions (fantasy_batch x raw_samples_batch x acq_batch) correctly. Having looked into this a while back, I don't think this is a quick fix but that is where I would start, since fantasize() calls condition_on_observations() on a num_fantasies batch of samples.

If I were you, I would override condition_on_observations() and check what is going on in there.

hvarfner avatar Jul 25 '25 14:07 hvarfner

Thank you for the suggestion ! I overrode condition_on_observations() as you advised (code below), but I’m now running into the following error : -

Output shape not equal to that of weights. Output shape is 1 and weights are torch.Size([64])

Here’s my current implementation of condition_on_observations() :-

def condition_on_observations(self, X: torch.Tensor, Y: torch.Tensor, **kwargs):
       
        model_batch_shape = self.batch_shape          
        model_batch_ndim = len(model_batch_shape)      

        
        start_idx = Y.ndim - (2 + model_batch_ndim)
        model_indices = list(range(start_idx, start_idx + model_batch_ndim))
        extra_indices = list(range(0, start_idx))
        remaining_indices = list(range(start_idx + model_batch_ndim, Y.ndim - 2))
        permute_order = (
            model_indices + extra_indices + remaining_indices + [Y.ndim - 2, Y.ndim - 1]
        )
        Y_perm = Y.permute(*permute_order).contiguous()
       
        
     
        if X.ndim == model_batch_ndim + 2 and X.shape[0] != model_batch_shape[0]:
            
            X = X.unsqueeze(0)
       
        if X.ndim == 2 and model_batch_ndim == 1:
            X = X.unsqueeze(0)  # → [1, q, d]

     
        num_extra_dims = len(extra_indices) + len(remaining_indices)
        for _ in range(num_extra_dims-1):
            X = X.unsqueeze(model_batch_ndim)

        
        target_shape = (
            model_batch_shape + tuple(Y_perm.shape[1:-2]) + X.shape[-2:]
        )
        X_expanded = X.expand(*target_shape).contiguous()
        print(X_expanded.shape)

     
        X_flat = X_expanded.reshape(*model_batch_shape, -1, X_expanded.shape[-1])
        Y_flat = Y_perm.reshape(*model_batch_shape, -1, Y_perm.shape[-1])
        print(X_flat.shape, Y_flat.shape)

        
        return super().condition_on_observations(X_flat, Y_flat, **kwargs)

Also, if I set raw samples > 1, I get the error - Expected the output shape to match either the t‑batch shape of X, or the model.batch_shape in the case of acquisition functions using batch models; but got output with shape torch.Size([]) for X with shape torch.Size([raw_samples, (q+num_fantasies), d])

ayesharoxx avatar Jul 25 '25 22:07 ayesharoxx

Hi, any help or guidance to fix the above issue would be greatly appreciated !

ayesharoxx avatar Jul 26 '25 21:07 ayesharoxx

Hi @ayesharoxx ,

It is difficult to say with the code that has been provided, but I would generally advice against permutes since I don't find them to be needed in most cases. Better preserve their order and make the required computation on each batch.

This may ultimately require a more comprehensive change to work long term across a broad range of use cases.

hvarfner avatar Jul 28 '25 08:07 hvarfner

Hi @hvarfner, thank you for the suggestion. I’ll look into it. Just one more thing I wanted to ask: I managed to get my qKnowledgeGradient code working, but I’m hitting a memory error when I increase raw_samples to more than 5. It works fine with raw_samples=5 and num_fantasies=64, but as soon as I increase the raw_samples (e.g to 10 or more), I get a "Cannot allocate memory" error.

Do you think this might be an issue with my code/setup, or is it an inherent limitation of how qKnowledgeGradient scales? Please advice

ayesharoxx avatar Jul 29 '25 11:07 ayesharoxx

To be precise, with raw samples = 20, num_fantasies = 32, q (in optimise_acqf) = 1, num_samples=256, thinning=16, num_restarts=3, I get the following error (with some print statements for debugging) :- X shape: torch.Size([10, 100]), train_Y shape: torch.Size([10, 1])

[KG] Input X shape: torch.Size([20, 33, 100]) [KG] Memory (MB): 554.7265625 [KG] X_actual shape: torch.Size([20, 1, 100]) [KG] X_fantasies (before reshape) shape: torch.Size([20, 32, 100]) [KG] X_fantasies reshaped: torch.Size([32, 20, 1, 100]) [KG] Fantasy model batch shape: torch.Size([16]) [KG] Memory after fantasize (MB): 650.95703125 !!! Error during value_function() X_fantasies shape: torch.Size([32, 20, 1, 100])

WARNING: [enforce fail at alloc_cpu.cpp:119] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 34611200000 bytes. Error code 12 (Cannot allocate memory). I can provide my full code if needed.

ayesharoxx avatar Jul 29 '25 12:07 ayesharoxx

Hi @ayesharoxx ,

That is almost certainly an issue with the code! 36G allocation seems excessive.

hvarfner avatar Jul 30 '25 13:07 hvarfner

The memory issue might be related to https://github.com/pytorch/botorch/issues/2310. Due to some odd broadcasting within PyTorch / GPyTorch, the ensemble models can consume excessive amounts of memory when evaluated with large batches (which you end up with due to fantasies).

saitcakmak avatar Jul 30 '25 14:07 saitcakmak

Thank you so much @saitcakmak for clearing my confusion. I got some useful insights from that post. However, now I am encountering another memory error in matern_kernel.py (on increasing raw samples to 50) at this line- :

x2_ = (x2 - mean).div(self.lengthscale)

Here are my x1, mean and x2 shapes at this point. x1.shape = torch.Size([32, 50, 16, 1, 100]) mean.shape = torch.Size([32, 50, 16, 1, 100]) x2.shape = torch.Size([32, 50, 16, 1611, 100]) I am using qknowledgeGradient in optimise_acqf with raw_samples = 50, q=1, num_samples(because I am using SaasFullyBayesianSingleTaskGP) = 256, thinning = 16.

The error I get is - line 101, in forward x2_ = (x2 - mean).div(self.lengthscale) RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 32993280000 bytes.

Is there any way to fix this? Thanks !

ayesharoxx avatar Aug 01 '25 01:08 ayesharoxx

The tensor shapes in that operation, particularly the shape of x2 is pretty large. Do you know why the -2 / q-batch dimension is 1611? I am guessing it is same dimension from the X you're evaluating the acquisition function with, plus the number of training points the model has. If you have a lot of training data, a fully Bayesian model may not be appropriate, since it will consume a lot more memory than a standard GP.

You can profile the offending operation like this:

import torch
from torch.profiler import profile, ProfilerActivity

mean = torch.rand(32, 50, 16, 1, 100)
x2 = torch.rand(32, 50, 16, 1611, 100, requires_grad=True)
lengthscale = torch.rand(16, 1, 100)

with profile(activities=[ProfilerActivity.CPU], record_shapes=True, profile_memory=True, with_stack=True) as prof:
    x_ = (x2 - mean).div(lengthscale)

print(prof.key_averages(group_by_input_shape=True).table())

which outputs

-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls                                             Input Shapes  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
    aten::sub        31.25%     223.996ms        31.25%     223.996ms     223.996ms      15.36 GB      15.36 GB             1      [[32, 50, 16, 1611, 100], [32, 50, 16, 1, 100], []]  
    aten::div        68.75%     492.718ms        68.75%     492.718ms     492.718ms      15.36 GB      15.36 GB             1                  [[32, 50, 16, 1611, 100], [16, 1, 100]]  
     [memory]         0.00%       0.000us         0.00%       0.000us       0.000us     -30.73 GB     -30.73 GB             2                                                       []  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------  
Self CPU time total: 716.714ms

Is there any way to fix this?

There isn't a way to "fix" this necessarily, as in nothing is broken in the code. It is just trying to do a very expensive operation. The way to reduce the memory usage would be to use smaller input tensors while evaluating the model or use a cheaper model.

saitcakmak avatar Aug 01 '25 14:08 saitcakmak

Hi @saitcakmak , thank you for your reply. The dimension 1611 arises because I consider 32 fantasy scenarios for each of the 50 candidate points (i.e. raw samples), which results in 32 × 50 = 1600 fantasy points. In my implementation of condition_on_observations, I flatten these fantasy points into a single training set dimension. Then, these 1600 fantasy points are combined with the original 10 training points and the single candidate point. This results in a total of 1600 + 10 + 1 = 1611 training points.

ayesharoxx avatar Aug 01 '25 19:08 ayesharoxx

32 × 50 = 1600 fantasy points

I am confused about why these 1600 points end up in the q-batch (-2) dimension. When we look at the tensor shape 32, 50, 16, 1611, 100, the 32 is the number of fantasies, 50 is these 50 candidate points, 16 is the model batch dimension and 100 is the problem dimension. The 32 x 50 number of fantasies x candidate points is already present in the tensor as added batch dimension. These batch dimensions get added during the model.fantasize operation. You shouldn't be repeating them again for the q-batch dimension.

saitcakmak avatar Aug 04 '25 15:08 saitcakmak

@saitcakmak Yes, I realised I was mishandling the shape in condition_on_observation. The initial shapes I receive in condition_on_observation are-

X.shape = [raw_samples, 1 , d]
Y.shape = [num_fantasies, raw_samples, mcmc, q, m]

I flattened 32, 50 into the single n' dimension (making it 1600) to match the shape [ batch_shape x n' x d ] so I reshaped X to - [mcmc_samples, num_fantasiesxraw_samples, d]. But I understand this is wrong and I have instead changed X to the following shape - [num_fantasies, raw_samples, mcmc, q, d] The code however runs only if I override the forward method of qKnowledgeGradient and get rid of the t_batch_mode_transform decorator, otherwise I get the following error in the acquisition function's shape checking: Expected the output shape to match either the t-batch shape of X, or the model.batch_shape in the case of acquisition functions using batch models; but got output with shape torch.Size([]) for X with shape torch.Size([50 (raw_samples), 33 (q_aug = 32(fantasies) + 1(q), 100 (d)]). This is because in the forward method of qKowledgeGradient, when value function is called on X_fantasies, the values shape is [32 (fanatsies),50 (raw_samples)] then the forward method returns values.mean(dim=0) making it [50]. So in the t_batch_mode_transform decorator, the shape of the output before mean is [50], which then becomes a scalar after mean (torch.size([])), and so doesn't match the batch shape of X = 50 or model_batch_shape = 16. I'm not sure what to do about this and whether removing the decorator is safe or could break the underlying logic.

ayesharoxx avatar Aug 05 '25 14:08 ayesharoxx

After writing this, I noticed that I am looking at the latest code rather than the stable 0.14.0. @average_over_ensemble_models used to be part of @t_batch_mode_transform, now it is a separate decorator.

This sounds like an issue with the value function here. Rather than [fantasies, raw_samples] the value function needs to return [fantasies, raw_samples, mcmc] for the rest of the qKnowledgeGradient code to return the expected [raw_samples] shaped output. The right move here would be to use a value function that does not use @average_over_ensemble_models decorator for the forward method. Alternatively, you could just remove the @average_over_ensemble_models decorator from the qKnowledgeGradient as you suggested. The code seems like it'd produce equivalent results in either case.

saitcakmak avatar Aug 05 '25 17:08 saitcakmak

@saitcakmak Thank you for the suggestions. I think the value function which is PosteriorMean in my case, does return [fantasies, raw_samples, mcmc] but it also has the @t_batch_mode_transform decorator which does output.mean(dim=-1) and reduces it to [fantasies, raw_samples], so the value shape received in qKnowledgeGradient becomes [fantasies, raw_samples], which then returns [raw_samples] before calling the decorator.

I am using botorch version 0.14.0 and I can only see the t_batch decorator in the respective files and not @average_over_ensemble_models like you said, so is it okay to just remove the t_batch decorator?

ayesharoxx avatar Aug 05 '25 19:08 ayesharoxx

Yes, it should be ok to remove the t-batch decorator

saitcakmak avatar Aug 05 '25 19:08 saitcakmak

Okay, thank you so much for your help!

ayesharoxx avatar Aug 05 '25 19:08 ayesharoxx

Hi, thank you again for your help, I really appreciate it. I had another quick question regarding input_batch_shape in get_fantasy_model() (inside ExactGP, as called by condition_on_observations).

In my case:

inputs[0].shape = torch.Size([num_fantasies, raw_samples, mcmc_samples, q, d])

train_inputs[0].shape = torch.Size([num_fantasies, raw_samples, mcmc_samples, n, d])

Since input_batch_shape = inputs[0].shape[:-2], the input_batch_shape becomes: torch.Size([num_fantasies, raw_samples, mcmc_samples])

Wouldn’t this batch shape be too large? Or is this actually the expected behavior? I am asking because if I try a higher number of raw samples and q (like 5000 and 5), then I get memory issues again.

ayesharoxx avatar Aug 07 '25 01:08 ayesharoxx

Yes, that'd be the correct input batch shape. You're adding additional batch dimensions (due to fantasies) to an already batched model (fully bayesian), so it is natural to run into memory issues when evaluating it on large inputs. The way you can get around this is to use batch limits in the optimizer. In optimize_acqf, you can pass in something like options={"init_batch_limit": 128, "batch_limit": 20} to limit the batch size of tensors used to evaluate the acquisition function at a given time (it will loop over the full tensor and evaluate chunks of it). init_batch_limit applies to raw_samples and batch_limit applies to num_restarts. The num_restarts part will be more memory intensive since it also needs to calculate the gradients, which is not necessary for evaluating the raw_samples.

saitcakmak avatar Aug 08 '25 13:08 saitcakmak

That works, Thank you !

ayesharoxx avatar Aug 09 '25 18:08 ayesharoxx