botorch icon indicating copy to clipboard operation
botorch copied to clipboard

[Bug]Gpmodel raises an error when fitting to my input data

Open ovr4 opened this issue 3 years ago • 6 comments

🐛 Bug

To reproduce

** Code snippet to reproduce **

fit_gpytorch_model(mll_ei)

** Stack trace/error message **

gpytorch.utils.errors.NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1e-5

Expected Behavior

It shouldn't have an issue when fitting the data to gpytorch model.

Additional context

Again - not a bug but just want to know why my model isn't fitting

ovr4 avatar Sep 27 '22 23:09 ovr4

This can happen if the training data results in ill-conditioned covariances. Do you have a lot of repeated (or almost repeated) points? Did you normalize / standardize the training inputs / observations? Are you using a torch.double datatype?

It's hard to say what's going on without a repro / code sample

Balandat avatar Sep 27 '22 23:09 Balandat

Hi!

Yes, I have almost repeated points after standardizing the data. I am not using torch.double datatype. Can this also happen if the tensors we are training are sparse?

ovr4 avatar Sep 28 '22 01:09 ovr4

I am not using torch.double datatype.

Using torch.double may take care of the issues, I would try that first. In general it's very hard to get GPs to work reliably with torch.float precision since the covariance matrices often end up being quite poorly conditioned. Unless you have massive amounts of data the slowdown from float -> double on the GPU won't be too bad (and they're pretty much the same on the CPU anyway).

Can this also happen if the tensors we are training are sparse?

What exactly do you mean by "sparse"? In the sense of only a few elements being nonzero? That shouldn't matter. If you mean in the sense that the training inputs differ only in a few features, then yes this could be an issue (basically this would cause many points to be almost the same in most dimensions thus the distances between the different observations will be small), .

Balandat avatar Sep 28 '22 01:09 Balandat

Hi! Sorry for the late reply.

We tried switching to a torch.double but it didn't fix the issue. Perhaps because we are using a CPU to run our optimization so float dtypes are equivalent to double dtypes as you've said?

Also, since we are doing our BO on a categorical space, we do have a lot of training inputs that differ only in a few features. Is this an issue that can be addressed?

ovr4 avatar Oct 07 '22 19:10 ovr4

Perhaps because we are using a CPU to run our optimization so float dtypes are equivalent to double dtypes as you've said?

Oh no that is not the case, what I meant is that on a CPU the wall time of doing things in float is typically similar to doing them in double. The precision is indeed very much different.

Also, since we are doing our BO on a categorical space, we do have a lot of training inputs that differ only in a few features. Is this an issue that can be addressed?

Yeah that's likely the underlying issue causing the failures. Can you tell me more about the search space? If it's of decently high cardinality in the categorical variables you may want to use some specialized non-standard methods to handle this.

Balandat avatar Oct 07 '22 20:10 Balandat

Got it. Ultimately, changing the dtype to double didn't resolve the issue. In fact, we got a different error:

RuntimeError: torch.linalg.eigh: the algorithm failed to converge; 2 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.

Our search space has a cardinality range of 3-8 for each categorical variable and our search space is strictly categorical. We are using one-hot-encoding (OHE) or one-dimensional numerical representations to describe each categorical variable's design choice.

ovr4 avatar Oct 07 '22 21:10 ovr4

Just following up on this issue in case you guys have any ideas. I'd love to hear your thoughts.

ovr4 avatar Oct 20 '22 17:10 ovr4

If I understand this correctly you will have several parameters that can only be 0 or 1 due to the use of one-hot encoding. I have seen this cause some numerical issues in the past because of lengthscales being pushed to very small values.

Two things that may be worth trying:

  1. Use an isotropic kernel: I assume you are using a Matern of maybe an RBF kernel in which case you can specify ard_num_dims=None when creating your kernel.
  2. Place constraints on the lengthscales: You can try passing in a lengthscale constraint to your Matern/RBF kernel and see if that helps. You can for example pass in lengthscale_constraint=Interval(0.1, 100) or a similar constraint when creating your kernel.

dme65 avatar Oct 20 '22 19:10 dme65

Closing as inactive / explained, but please reopen with any further bugs, questions, or suggestions!

esantorella avatar Jan 30 '23 17:01 esantorella