botorch icon indicating copy to clipboard operation
botorch copied to clipboard

[Bug] different slice behavior of mvns calculation in gpytorch.posterior method when input training data size of model are 200 and 600

Open Leon924 opened this issue 2 years ago • 6 comments

🐛 Bug

To reproduce

** Code snippet to reproduce **

Sorry to say code repo is too much  to provide. 

I am using a DKL-GP model, which just add few nn layers before GP layer. The DKL-GP model is inherited from  BatchedMultiOutputGPyTorchModel and ExactGP. 
The stack trace happens when I feed 600x19 -size training data into model. but when I just change training data size to 200x19. It disappear. I dont know what's going on here.
The difference between my two comparative try is only training data size, one is 200X19 and another is 600x19.

** Stack trace/error message **

Traceback (most recent call last):
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/optim/optimize.py", line 613, in optimize_acqf_discrete
    max_batch_size=max_batch_size,
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/optim/optimize.py", line 646, in _split_batch_eval_acqf
    return torch.cat([acq_function(X_) for X_ in X.split(max_batch_size)])
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/optim/optimize.py", line 646, in <listcomp>
    return torch.cat([acq_function(X_) for X_ in X.split(max_batch_size)])
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/utils/transforms.py", line 301, in decorated
    return method(cls, X, **kwargs)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/utils/transforms.py", line 258, in decorated
    output = method(acqf, X, *args, **kwargs)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/acquisition/multi_objective/monte_carlo.py", line 336, in forward
    posterior = self.model.posterior(X)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 365, in posterior
    for t in output_indices
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/botorch/models/gpytorch.py", line 365, in <listcomp>
    for t in output_indices
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 2201, in __getitem__
    res = self._getitem(row_index, col_index, *batch_indices)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/gpytorch/lazy/sum_lazy_tensor.py", line 36, in _getitem
    results = [lazy_tensor._getitem(row_index, col_index, *batch_indices) for lazy_tensor in self.lazy_tensors]
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/gpytorch/lazy/sum_lazy_tensor.py", line 36, in <listcomp>
    results = [lazy_tensor._getitem(row_index, col_index, *batch_indices) for lazy_tensor in self.lazy_tensors]
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py", line 140, in _getitem
    new_kernel = self.kernel.__getitem__(batch_indices)
  File "/export1/Workspace/liqiang/tool/anaconda3/envs/socgen-nextorch/lib/python3.7/site-packages/gpytorch/kernels/kernel.py", line 434, in __getitem__
    new_kernel._parameters[param_name].data = param.__getitem__(index)
IndexError: too many indices for tensor of dimension 1

Expected Behavior

System information

Please complete the following information:

  • BoTorch Version 0.6.4
  • GPyTorch Version 1.6.0
  • PyTorch Version 1.11.0+cpu
  • Computer OS :RHEL 6.10

Additional context

In 600x19 training-data-size case. when I use optimize_acqf_discreate to find next optimal point. it going inside below posterior calculation of forward method, https://github.com/pytorch/botorch/blob/a675968d1a64849938ec935e4a619bf984a33637/botorch/acquisition/multi_objective/monte_carlo.py#L335-L338 when it execute to L360, I found the lazy_covariance_matrix of mvn is NonLazyTensor in the "200x19" case, but is SumLazyTensor in the 600x19 case. I dont know the difference between this two class. Is it the key leading to error occurs? and why the IndexError happens only in 600x 19 case?

https://github.com/pytorch/botorch/blob/a675968d1a64849938ec935e4a619bf984a33637/botorch/models/gpytorch.py#L356-L366

Leon924 avatar Jul 07 '22 13:07 Leon924

Thanks for flagging this. I believe this could be indexing bug on the gpytorch side of things related to lazy kernel evaluation. To check this, can you try wrapping your code in the following context:

with gpytorch.settings.max_eager_kernel_size(1000):
    ...

Here 1000 specifies the number of data points until which kernels are eagerly evaluated. If this works with the 600x19 case then we know that that is indeed the issue.

Also, what botorch and gpytorch versions are you using?

Balandat avatar Jul 07 '22 14:07 Balandat

Thanks for so quick response ! I will try to revise this setting and updates results later. Can I just directly revise here? https://github.com/cornellius-gp/gpytorch/blob/45e560c3417cb970c3a402f8a1a92f87b733e470/gpytorch/settings.py#L401-L409

BoTorch Version 0.6.4 GPyTorch Version 1.6.0

Leon924 avatar Jul 07 '22 14:07 Leon924

It works ! I just directly revise https://github.com/cornellius-gp/gpytorch/blob/45e560c3417cb970c3a402f8a1a92f87b733e470/gpytorch/settings.py#L409 from 512 to 1000. then index error disappear.

Leon924 avatar Jul 07 '22 14:07 Leon924

Yeah so it's an indexing issue with lazy tensors then, we'll need to investigate this on the gpytorch end.

If you change https://github.com/cornellius-gp/gpytorch/blob/45e560c3417cb970c3a402f8a1a92f87b733e470/gpytorch/settings.py#L409 you're just modifying the defaults. Using the context manager you can do that in a local context without touching the gpytorch code.

Balandat avatar Jul 09 '22 04:07 Balandat

Got it ! Thank you so much.

Close this issue now.

Leon924 avatar Jul 09 '22 06:07 Leon924

reopening this until the issue is fixed upstream

Balandat avatar Aug 07 '22 14:08 Balandat

This should be fixed. See https://github.com/pytorch/botorch/issues/1291

esantorella avatar May 08 '23 14:05 esantorella

Not sure this is "fixed" rather than being avoided - I think the underlying issue is still the long-standing gpytorch issue https://github.com/cornellius-gp/gpytorch/issues/1853. But should be ok to close this BoTorch issue here.

Balandat avatar May 08 '23 14:05 Balandat