gpytorch
gpytorch copied to clipboard
Nonstationary kernels
What's the best approach to creating a kernel that locally adapts to varying smoothness in the data? I'm specifically looking at this paper:
Plagemann 2008 Nonstationary Gaussian Process Regression using Point Estimates of Local Smoothness: http://ais.informatik.uni-freiburg.de/publications/papers/plagemann08ecml.pdf
The idea is to use essentially a K-means type clustering algorithm on the training points to find 'inducing' type of locations and a local GP is fit to each of those areas with a parent latent GP governing the local length-scales.
Is this something that would be best suited as a kernel wrapper class like the ScaleKernel or InducingPointKernel classes? All of the GPs need to be optimized together. Any guidance would be appreciated.
So if I understand that paper correctly, they basically derive a kernel (Equation 4) that allows you to define arbitrary functions over lengthscales and then when you compute k(x_i, x_j) you also compute \ell(x_i) and \ell(x_j).
How set are you on the "use a GP prior over \ell" plan / how little data do you have? Implementing \ell(x) using like a two layer neural network (plus a ReLU at the end to make the output positive) would be really trivial, since you could just register the neural network in this hypothetical NonstationaryKernel and then when forward hands you x1 and x2 you pass both through the NN to get \ell_i and \ell_j from their equation 4. After that, training through the exact MLL should do the right thing. Maybe.
Using a GP won't be that much harder, but you'll probably have to define the loss in Equation 5 by hand rather than using our ExactMarginalLogLikelihood.
Yes, was worried it would require a different loss. Your first option sounds reasonable. In practice I wonder how much the GP prior over all would help or potentially make gradient optimization more difficult.
Right now I'm just working with a small dataset of about 50 rows that have 'bumpy' patterns hence the need for multiple GPs to best explain the field. With the neural network approach it should scale reasonably for ExactGP. Thought on how that structure would look?
@stanbiryukov Yeah, so basically what I'd do is define an new kernel:
class NonstationaryKernel(Kernel):
def __init__(self, neural_net):
# Register base modules, call super init etc
def forward(self, x1, x2, diag=False, **params):
ls1 = self.neural_net(x1) # Assume NN maps n x d -> n x 1, or b x n x d -> b x n x 1 for batch.
ls2 = self.neural_net(x2) # Same here
# Compute rest of the kernel from the paper where ls1[i] corresponds to \ell_i, e.g. can do x1.div(ls1)
# And ls2[j] corresponds to \ell_j.
# This also assumes that your NN has a ReLU at the end.
Does that make sense? In terms of the NN architecture, I'd probably start with something pretty small, like a 2 layer net with smoother (e.g. like tanh) nonlinearities.
Thanks for the direction, @jacobrgardner .
This makes sense. I've drafted a more fleshed out example below. If I train with the default exact GP framework in Simple_GP_Regression.ipynb, I mostly get singular covariance matrix errors. However, when the model does run for 50 iterations, model(test_x).mean.cpu() produces something along the lines of The size of tensor a (51) must match the size of tensor b (151) at non-singleton dimension 0 . Anything look immediately wrong?
class NSNet(torch.nn.Sequential):
def __init__(self, data_dim):
super(NSNet, self).__init__()
self.data_dim = data_dim
self.add_module("linear1", torch.nn.Linear(data_dim, 256, bias=True))
self.add_module("act1", torch.nn.Tanh())
self.add_module("linear2", torch.nn.Linear(256, 1, bias=True))
self.add_module("actout", torch.nn.ReLU())
class NonstationaryKernel(gpytorch.kernels.Kernel):
def __init__(self, data_dim):
super(NonstationaryKernel, self).__init__(data_dim)
# Register base modules, call super init etc
self.neural_net = NSNet(data_dim=data_dim)
def forward(self, x1, x2, diag=False, **params):
ls1 = self.neural_net(x1) # Assume NN maps n x d -> n x 1, or b x n x d -> b x n x 1 for batch.
ls2 = self.neural_net(x2) # Same here
# Compute rest of the kernel from the paper where ls1[i] corresponds to \ell_i, e.g. can do x1.div(ls1)
# And ls2[j] corresponds to \ell_j.
# This also assumes that your NN has a ReLU at the end.
x1_ = x1.div(self.lengthscale)
x2_ = x2.div(self.lengthscale)
# Distance matrix
X1 = x1_.pow(2).sum(dim=1, keepdim=True)
Z2 = x2_.pow(2).sum(dim=1, keepdim=True)
XZ = x1_.matmul(x2_.t())
r2 = X1 - 2 * XZ + Z2.t()
d_ = torch.sqrt(r2.clamp(min=1e-12))
# Equation 4
k_ = torch.exp(-d_.t() * ((ls1.sum(dim=1).add(ls2.sum(dim=1)).div(2)).pow(-1)) * d_)
n = torch.abs(ls1.sum(dim=1)).pow(.25).mul(torch.abs(ls2.sum(dim=1)).pow(.25))
d = torch.abs( (ls1.sum(dim=1).add(ls2.sum(dim=1))).div(2)).pow(.5).clamp(min=1e-12)
rs = n.div(d)
k = rs.mul(k_)
# print(rs)
return k
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
if train_x.dim() > 1:
# wrap with gpytorch.kernels.ScaleKernel?
self.covar_module = gpytorch.kernels.ScaleKernel(NonstationaryKernel(data_dim=train_x.shape[1]))
else:
self.covar_module = gpytorch.kernels.ScaleKernel(NonstationaryKernel(data_dim=1))
# self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
Here's a better working example where it appears the parameters are being learned with SGD. However, I'm puzzled by what looks like x2 changing shape in the forward function when I set the model in eval and attempt to predict.
class NSNet(torch.nn.Sequential):
def __init__(self, data_dim):
super(NSNet, self).__init__()
self.data_dim = data_dim
self.add_module("linear1", torch.nn.Linear(data_dim, 256, bias=True))
self.add_module("act1", torch.nn.ReLU())
self.add_module("linear2", torch.nn.Linear(256, 128, bias=True))
self.add_module("act2", torch.nn.ReLU())
self.add_module("linearout", torch.nn.Linear(128, 1, bias=True))
class NonstationaryKernel(gpytorch.kernels.Kernel):
def __init__(self, data_dim, **kwargs):
super(NonstationaryKernel, self).__init__(data_dim, **kwargs)
# Register base modules, call super init etc
self.neural_net = NSNet(data_dim=data_dim)
def forward(self, x1, x2, diag=False, **params):
print(x1.shape, x2.shape)
ls1 = torch.exp(self.neural_net(x1)) # Assume NN maps n x d -> n x 1, or b x n x d -> b x n x 1 for batch.
ls2 = torch.exp(self.neural_net(x2)) # Same here
# clone because inplace operations will mess with what's saved for backward
ls1_ = ls1.clone()
ls2_ = ls2.clone()
print(ls1.shape, ls2.shape)
# Compute rest of the kernel from the paper where ls1[i] corresponds to \ell_i, e.g. can do x1.div(ls1)
# And ls2[j] corresponds to \ell_j.
# This also assumes that your NN has a ReLU at the end.
x1_ = x1.div(self.lengthscale)
x2_ = x2.div(self.lengthscale)
# Distance matrix
X1 = x1_.pow(2).sum(dim=1, keepdim=True)
X2 = x2_.pow(2).sum(dim=1, keepdim=True)
XX = x1_.matmul(x2_.t())
r2 = X1 - 2 * XX + X2.t()
d = torch.sqrt(r2.clamp(min=1e-12))
d_ = d.clone()
# Equation 4
lx = ((ls1_.sum(dim=1).add(ls2_.sum(dim=1)).div(2)).pow(-1))
k_ = torch.exp(- d_.t() * lx * d_)
print (lx.shape, d_.shape)
n = torch.abs(ls1_.sum(dim=1)).pow(.25).mul(torch.abs(ls2_.sum(dim=1)).pow(.25))
dn = torch.abs( (ls1_.sum(dim=1).add(ls2_.sum(dim=1))).div(2)).pow(.5).clamp(min=1e-12)
kout = n.div(dn).mul(k_)
print(kout.shape)
return kout
train_x = torch.linspace(0, 1, 100).reshape(-1,1)
train_y = torch.sin(train_x * (2 * np.pi)) + torch.randn(train_x.size()) * 0.2
train_y = train_y.reshape(-1,)
test_x = torch.linspace(0, 1, 51).reshape(-1,1)
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(NonstationaryKernel(data_dim=train_x.shape[1]))
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use SGD
optimizer = torch.optim.SGD([
{'params': model.parameters()}, # Includes GaussianLikelihood parameters
], lr=.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
training_iter = 10
for i in range(training_iter):
# Zero gradients from previous iteration
optimizer.zero_grad()
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f lengthscale: %.3f noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()
torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 1/10 - Loss: 0.902 lengthscale: 0.693 noise: 0.693 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 2/10 - Loss: 0.887 lengthscale: 0.689 noise: 0.679 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 3/10 - Loss: 0.873 lengthscale: 0.684 noise: 0.665 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 4/10 - Loss: 0.857 lengthscale: 0.680 noise: 0.651 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 5/10 - Loss: 0.841 lengthscale: 0.675 noise: 0.636 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 6/10 - Loss: 0.825 lengthscale: 0.670 noise: 0.622 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 7/10 - Loss: 0.810 lengthscale: 0.666 noise: 0.607 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 8/10 - Loss: 0.795 lengthscale: 0.662 noise: 0.592 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 9/10 - Loss: 0.780 lengthscale: 0.658 noise: 0.578 torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100, 1]) torch.Size([100]) torch.Size([100, 100]) torch.Size([100, 100]) Iter 10/10 - Loss: 0.767 lengthscale: 0.655 noise: 0.564
model.eval()
likelihood.eval()
with torch.no_grad():
fposterior = model(test_x)
torch.Size([51, 1]) torch.Size([151, 1]) torch.Size([51, 1]) torch.Size([151, 1]) RuntimeError: The size of tensor a (51) must match the size of tensor b (151) at non-singleton dimension 0
Realized I can default to the built-in self.covar_dist instead of building the distance matrix. Nonetheless, @jacobrgardner, I'm not sure what changes in eval mode that makes x2 a non-conforming shape. Any suggestions?
Must be related to this concatenation given my training data is of length 100 and test is 51: https://github.com/cornellius-gp/gpytorch/blob/90489c5b41ec79f646cae26332ecbad2ce1fac5d/gpytorch/models/exact_gp.py#L297
@jacobrgardner It seems that embedding a neural network within a gpytorch.kernels.Kernel is leading to issues with posterior inference. Are there components that I'm missing in the above example or is this a separate issue?
Hey @stanbiryukov
Sorry this took so long to look in to, things are pretty busy right before ICML.
Your issue is just batch shaping stuff that only ends up showing up when we try to compute cross covariances between training data and test data. For example:
lx = ((ls1_.sum(dim=1).add(ls2_.sum(dim=1)).div(2)).pow(-1))
doesn't quite do what you want -- I think what you probably want lx to be is the matrix whose ijth entry is 0.5*\ell_i^2 + 0.5*\ell_j^2. To accomplish this, you can add unsqueezes (although the sum then unsqueeze for ls1_ is redundant):
lx = ((ls1_.sum(dim=1).unsqueeze(-1).add(ls2_.sum(dim=1).unsqueeze(-2)).div(2)).pow(-1))
The way this works is it takes ls1_ to be 51 x 1 and ls2_ to be 1 x 151 so that ls1_ + ls2_ broadcasts to give a 51 x 151 matrix.
I'm also not clear why you want:
k_ = torch.exp(- d_.t() * lx * d_)
since d_ was already squared distances, then you just took the square root. Probably this should be like k_ = torch.exp(-d_ * lx) or something where you haven't done d_.sqrt()?
Replacing some of your code with the following works for me in the sense that it at least runs through, but I'd recommend double checking the math:
r2 = X1 - 2 * XX + X2.t()
d_ = r2
# Equation 4
lx = ((ls1_.sum(dim=1).unsqueeze(-1).add(ls2_.sum(dim=1).unsqueeze(-2)).div(2)).pow(-1))
k_ = torch.exp(- d_ * lx)
print (lx.shape, d_.shape)
n = torch.abs(ls1_.sum(dim=1).unsqueeze(-1)).pow(.25).mul(torch.abs(ls2_.sum(dim=1).unsqueeze(-2)).pow(.25))
dn = torch.abs( (ls1_.sum(dim=1).unsqueeze(-1).add(ls2_.sum(dim=1).unsqueeze(-2))).div(2)).pow(.5).clamp(min=1e-12)
kout = n.div(dn).mul(k_)
Hi @jacobrgardner totally understand - thanks for the help as always. I'll see if I can make it to ICML this year.
The unsqueeze makes sense so thanks for guiding me through the broadcasting fix. I'll replace distance calc with self.covar_dist and work a bit more on this.
Hi @jacobrgardner - with version 1.0.1 of gpytorch, I'm seeing a lazy evaluation error that I'm having trouble tracing back to the root cause. Any suggestions on how to now properly evaluate the neural network and kernel parameters in the forward definition?
torch.Size([100, 1]) torch.Size([100, 1])
TypeError: div(): argument 'other' (position 1) must be Tensor, not NoneType
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-21f16b3dd7b7> in <module>
78 output = model(train_x)
79 # Calc loss and backprop gradients
---> 80 loss = -mll(output, train_y)
81 loss.backward()
82 print('Iter %d/%d - Loss: %.3f lengthscale: %.3f noise: %.3f' % (
/opt/conda/lib/python3.6/site-packages/gpytorch/module.py in __call__(self, *inputs, **kwargs)
26
27 def __call__(self, *inputs, **kwargs):
---> 28 outputs = self.forward(*inputs, **kwargs)
29 if isinstance(outputs, list):
30 return [_validate_module_outputs(output) for output in outputs]
/opt/conda/lib/python3.6/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py in forward(self, function_dist, target, *params)
49 # Get the log prob of the marginal distribution
50 output = self.likelihood(function_dist, *params)
---> 51 res = output.log_prob(target)
52
53 # Add additional terms (SGPR / learned inducing points, heteroskedastic likelihood models)
/opt/conda/lib/python3.6/site-packages/gpytorch/distributions/multivariate_normal.py in log_prob(self, value)
133
134 # Get log determininat and first part of quadratic form
--> 135 inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
136
137 res = -0.5 * sum([inv_quad, logdet, diff.size(-1) * math.log(2 * math.pi)])
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor.py in inv_quad_logdet(self, inv_quad_rhs, logdet, reduce_inv_quad)
1000 from .chol_lazy_tensor import CholLazyTensor
1001
-> 1002 cholesky = CholLazyTensor(self.cholesky())
1003 return cholesky.inv_quad_logdet(inv_quad_rhs=inv_quad_rhs, logdet=logdet, reduce_inv_quad=reduce_inv_quad)
1004
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor.py in cholesky(self, upper)
737 (LazyTensor) Cholesky factor (lower triangular)
738 """
--> 739 res = self._cholesky()
740 if upper:
741 res = res.transpose(-1, -2)
/opt/conda/lib/python3.6/site-packages/gpytorch/utils/memoize.py in g(self, *args, **kwargs)
32 cache_name = name if name is not None else method
33 if not is_in_cache(self, cache_name):
---> 34 add_to_cache(self, cache_name, method(self, *args, **kwargs))
35 return get_from_cache(self, cache_name)
36
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor.py in _cholesky(self)
400 from .keops_lazy_tensor import KeOpsLazyTensor
401
--> 402 evaluated_kern_mat = self.evaluate_kernel()
403
404 if any(isinstance(sub_mat, KeOpsLazyTensor) for sub_mat in evaluated_kern_mat._args):
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor.py in evaluate_kernel(self)
883 all lazily evaluated kernels actually evaluated.
884 """
--> 885 return self.representation_tree()(*self.representation())
886
887 def inv_matmul(self, right_tensor, left_tensor=None):
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor.py in representation_tree(self)
1273 including all subobjects. This is used internally.
1274 """
-> 1275 return LazyTensorRepresentationTree(self)
1276
1277 @property
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_tensor_representation_tree.py in __init__(self, lazy_tsr)
11 for arg in lazy_tsr._args:
12 if hasattr(arg, "representation") and callable(arg.representation): # Is it a lazy tensor?
---> 13 representation_size = len(arg.representation())
14 self.children.append((slice(counter, counter + representation_size, None), arg.representation_tree()))
15 counter += representation_size
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py in representation(self)
311 # representation
312 else:
--> 313 return self.evaluate_kernel().representation()
314
315 def representation_tree(self):
/opt/conda/lib/python3.6/site-packages/gpytorch/utils/memoize.py in g(self, *args, **kwargs)
32 cache_name = name if name is not None else method
33 if not is_in_cache(self, cache_name):
---> 34 add_to_cache(self, cache_name, method(self, *args, **kwargs))
35 return get_from_cache(self, cache_name)
36
/opt/conda/lib/python3.6/site-packages/gpytorch/lazy/lazy_evaluated_kernel_tensor.py in evaluate_kernel(self)
278 temp_active_dims = self.kernel.active_dims
279 self.kernel.active_dims = None
--> 280 res = self.kernel(x1, x2, diag=False, last_dim_is_batch=self.last_dim_is_batch, **self.params)
281 self.kernel.active_dims = temp_active_dims
282
/opt/conda/lib/python3.6/site-packages/gpytorch/kernels/kernel.py in __call__(self, x1, x2, diag, last_dim_is_batch, **params)
394 res = LazyEvaluatedKernelTensor(x1_, x2_, kernel=self, last_dim_is_batch=last_dim_is_batch, **params)
395 else:
--> 396 res = lazify(super(Kernel, self).__call__(x1_, x2_, last_dim_is_batch=last_dim_is_batch, **params))
397 return res
398
/opt/conda/lib/python3.6/site-packages/gpytorch/module.py in __call__(self, *inputs, **kwargs)
26
27 def __call__(self, *inputs, **kwargs):
---> 28 outputs = self.forward(*inputs, **kwargs)
29 if isinstance(outputs, list):
30 return [_validate_module_outputs(output) for output in outputs]
<ipython-input-13-21f16b3dd7b7> in forward(self, x1, x2, diag, **params)
20 def forward(self, x1, x2, diag=False, **params):
21 print(x1.shape, x2.shape)
---> 22 x1_ = x1.div(self.lengthscale)
23 x2_ = x2.div(self.lengthscale)
24 ls1 = torch.exp(
TypeError: div(): argument 'other' (position 1) must be Tensor, not NoneType
@stanbiryukov looks like this might have to do with a change we made to how we mark kernels as having lengthscales we didn't think about being breaking.
Try adding this: https://github.com/cornellius-gp/gpytorch/blob/5ea8483114813655c792321b334a1f8e15788d74/gpytorch/kernels/rbf_kernel.py#L70
I was curious if/how those flags were used. Unfortunately, adding that didn't seem to work - received the exact same error.
@stanbiryukov Could you provide an updated sample of code to run that produces the error?
@jacobrgardner Sure - here's a colab notebook with a MWE: https://colab.research.google.com/drive/1ymAidoiNmK73pa5xzG4BMnUAxX9wjZDD
Hi @stanbiryukov!
I am currently working on a problem, where the solution might be a kernel that locally adapts to varying smoothness as you describe here. I was wondering whether you had any succes with your approach based on the Plagemann Paper?
Kind regards Kasper
Hi @KasperNiklas - in practice, I found that utilizing an anisotropic Matérn kernel, with additional feature engineering, worked quite well for a breadth of problems. Specifically, for time series interpolation and forecasting, I added additional covariates by decomposing timestamps into cyclical continuous features (ie sine / cosine transforms of day, month). This allows the kernel to learn unique lengthscale across more degrees of freedom and provided the adaptive local smoothing I was looking for.
That said, looking at the problem you referenced, a multivariate gaussian structure may not be the best candidate given the sharpness in the data. I recommend modeling with a normalizing flow which can resolve more complex pdfs than GPs. See Neural Spline Flows for an overview. I've found pzflow an excellent library for modeling the joint probability distribution of data that is not as amenable to the gaussian likelihood of GPs.
Thank you @stanbiryukov for taking the time to answer my question! I like your idea with decomposition of the timestamps, I will definitely remember that one for future projects :)
We will look into your suggestion of modeling the problem with a normalising flow