gpytorch
gpytorch copied to clipboard
NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 when running on GPU
I get the following error: NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 The below code works when running on CPU, but not when I switch to GPU. Why does it only work on CPU and not on GPU too?
I am training a DGP using pytorch lightning for regression which I have constructed like this. Input dimension to the first DGP layer is 256:
class PL_model(pl.LightningModule):
def __init__(self,
batch_size,
lr,
betas,
num_samples,
num_output_dims
):
super().__init__()
# Training parameters
self.batch_size = batch_size
self.lr = lr
self.betas = betas
self.num_samples = num_samples
self.num_output_dims = num_output_dims
self.gpmodel = DeepGP(256, self.num_output_dims)
self.mll = DeepApproximateMLL(VariationalELBO(self.gpmodel.likelihood, self.gpmodel, self.batch_size))
def forward(self, x):
# compute prediction
return self.gpmodel(x)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.lr, betas=self.betas, weight_decay=1e-3)
def training_step(self, batch, batch_idx):
x, y = batch
with gpytorch.settings.num_likelihood_samples(self.num_samples):
output = self(x)
loss = -self.mll(output, y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
with torch.no_grad():
lls = self.gpmodel.likelihood.log_marginal(y, self(x))
return -lls
def setup(self, stage=None):
dataset = load_dataset()
train_split = int(0.8 * len(dataset))
val_split = len(dataset) - train_split
self.train_set, self.val_set = random_split(dataset, [train_split, val_split])
def train_dataloader(self):
return DataLoader(self.train_set, batch_size=self.batch_size, shuffle=True)
def val_dataloader(self):
return DataLoader(self.val_set, batch_size=self.batch_size, shuffle=False)
model = PL_model(
batch_size=32,
lr=0.1,
betas=(0.85,0.89),
num_samples=3,
num_output_dims=2
)
trainer = pl.Trainer(
min_epochs=5,
max_epochs=8,
gpus=1,
logger=TensorBoardLogger("lightning_logs/", name="DGP")
)
trainer.fit(model)
I got the same error, did you fix it? I actually got the other way around: Mine was working fine on GPU but not on CPU...
No, not yet unfortunately. I think I might just skip PL for now.
There are slight differences in GPU versus CPU routines, and it's possible that the roundoff errors go in "different directions" for different hardware. CUDA even has some stochastic routines which could be causing the difference.
The real issue here is that your model/data is likely not very stable, and so you are on the cusp of numerical errors on either hardware. Try z-scoring your data, and see if that makes the model more stable.
@gpleiss I got the error while using the VNNGP at the prediction time. My data is already normalized.
@mbelalsh it happens - it's a known stability issue with Gaussian processes. It is a property of your data, and the fact that all computations are done in single precision.
Try switching to double precision, or using smaller learning rates on your optimizer.