gpytorch icon indicating copy to clipboard operation
gpytorch copied to clipboard

NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 when running on GPU

Open AugustEhl opened this issue 2 years ago • 3 comments

I get the following error: NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 The below code works when running on CPU, but not when I switch to GPU. Why does it only work on CPU and not on GPU too?

I am training a DGP using pytorch lightning for regression which I have constructed like this. Input dimension to the first DGP layer is 256:

class PL_model(pl.LightningModule):
    def __init__(self,
                 batch_size,
                 lr,
                 betas,
                 num_samples,
                 num_output_dims
                ):
        super().__init__()
        # Training parameters
        self.batch_size = batch_size
        self.lr = lr
        self.betas = betas
        self.num_samples = num_samples
        self.num_output_dims = num_output_dims
        
        self.gpmodel = DeepGP(256, self.num_output_dims)
        self.mll = DeepApproximateMLL(VariationalELBO(self.gpmodel.likelihood, self.gpmodel, self.batch_size))
    
    def forward(self, x):
        # compute prediction
        return self.gpmodel(x)
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.lr, betas=self.betas, weight_decay=1e-3)
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        with gpytorch.settings.num_likelihood_samples(self.num_samples):
            output = self(x)
            loss = -self.mll(output, y)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        with torch.no_grad():
            lls = self.gpmodel.likelihood.log_marginal(y, self(x))
        return -lls
    
    def setup(self, stage=None):
        dataset = load_dataset()
        train_split = int(0.8 * len(dataset))
        val_split = len(dataset) - train_split
        self.train_set, self.val_set = random_split(dataset, [train_split, val_split])
        
    def train_dataloader(self):
        return DataLoader(self.train_set, batch_size=self.batch_size, shuffle=True)
    
    def val_dataloader(self):
        return DataLoader(self.val_set, batch_size=self.batch_size, shuffle=False)
    
model = PL_model(
    batch_size=32,
    lr=0.1,
    betas=(0.85,0.89),
    num_samples=3,
    num_output_dims=2
)

trainer = pl.Trainer(
    min_epochs=5,
    max_epochs=8,
    gpus=1,
    logger=TensorBoardLogger("lightning_logs/", name="DGP")
)
trainer.fit(model)

AugustEhl avatar Jun 23 '22 12:06 AugustEhl

I got the same error, did you fix it? I actually got the other way around: Mine was working fine on GPU but not on CPU...

Songloading avatar Jun 24 '22 20:06 Songloading

No, not yet unfortunately. I think I might just skip PL for now.

AugustEhl avatar Jun 25 '22 09:06 AugustEhl

There are slight differences in GPU versus CPU routines, and it's possible that the roundoff errors go in "different directions" for different hardware. CUDA even has some stochastic routines which could be causing the difference.

The real issue here is that your model/data is likely not very stable, and so you are on the cusp of numerical errors on either hardware. Try z-scoring your data, and see if that makes the model more stable.

gpleiss avatar Jun 29 '22 17:06 gpleiss

@gpleiss I got the error while using the VNNGP at the prediction time. My data is already normalized.

mbelalsh avatar Dec 08 '22 08:12 mbelalsh

@mbelalsh it happens - it's a known stability issue with Gaussian processes. It is a property of your data, and the fact that all computations are done in single precision.

Try switching to double precision, or using smaller learning rates on your optimizer.

gpleiss avatar Dec 09 '22 23:12 gpleiss