GPflow
GPflow copied to clipboard
Multi-class classification with softmax likelihood
Hi there,
since this is a "how do I do" question, I first posted this on Stackoverflow, but having received no response there, I am copying this question here.
The GPflow docs provide an example for multi-class classification with the robust-max function. I am trying to train a multi-class classifier with the softmax likelihood instead, which is also implemented in GPflow but I can't find any documentation or examples on how to use it correctly.
Please find an example of what I have tried below. During training, the loss decreases smoothly.
The robust-max example mentioned above uses categorical labels, i.e., values 0, 1, 2, but simply replacing the robust-max by the softmax likelihood raises an IndexError in the quadrature method. Therefore, I assume this model with the softmax likelihood requires one-hot encoded labels. At test time, however, I noticed that for this three-class toy example the model never predicts class 3. Upon closer inspection, the softmax likelihood has the following method
def _log_prob(self, F, Y):
return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=Y[:, 0])
which looks like it expects an array of categorical labels of shape [num_samples, 1]
.
What is the right way to use the softmax likelihood for GP multi-class classification?
import numpy as np
import tensorflow as tf
import gpflow
from gpflow.likelihoods.multiclass import Softmax
from tqdm.auto import tqdm
np.random.seed(0)
tf.random.set_seed(123)
# Number of functions and number of data points
num_classes = 3
N = 100
# Create training data
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
# SquaredExponential kernel matrix
kernel_se = gpflow.kernels.SquaredExponential(lengthscales=0.1)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(num_classes)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
# One-hot encoding
Y_hot = np.zeros((N, num_classes), dtype=np.int)
Y_hot[np.arange(N), Y] = 1
data = (X, Y_hot)
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
likelihood = Softmax(num_classes)
m = gpflow.models.VGP(
data=data,
kernel=kernel,
likelihood=likelihood,
num_latent_gps=num_classes,
)
def run_adam(model, iterations):
"""
Utility function running the Adam optimizer
"""
# Create an Adam Optimizer action
losses = []
training_loss = model.training_loss
optimizer = tf.optimizers.Adam()
@tf.function
def optimization_step():
optimizer.minimize(training_loss, model.trainable_variables)
for step in tqdm(range(iterations), total=iterations):
optimization_step()
if step % 10 == 0:
elbo = -training_loss().numpy()
losses.append(elbo)
return losses
run_adam(model=m, iterations=10000)
y_pred = m.predict_y(X)[0]
print("Training accuracy: {:3.2f}".format(np.mean(Y == np.argmax(y_pred, axis=1))))
It seems that the inference procedure is not ``stochastic'': in each iteration, the whole dataset is used to calculate the ELBO. Therefore, adam here seems not necessary......
Softmax uses Monte Carlo to estimate the expected log likelihood, so you do always need Adam. The nice thing about RobustMax is that for the full dataset, you can use bfgs. My experience is that RobustMax isn't that great from a statistical perspective though...
I can never remember what form the labels should be passed in, and it's very possible that there is a difference between RobustMax
and SoftMax
. You found the correct method, and I would refer to the TensorFlow documentation of tf.nn.sparse_softmax_cross_entropy_with_logits
to determine what the right format is.
Hope this helps.
Thanks for the hints. The method tf.nn.sparse_softmax_cross_entropy_with_logits
expects categorical labels, i.e., integer values in the range [0, num_classes)
.
If I interpret the line D_out = Y.shape[1]
correctly, the quadrature method expects one-hot encoded labels, though.
The code example above seems to work as expected when I replace labels=Y[:, 0]
with labels=tf.argmax(Y, axis=1)
in the softmax's _log_prob
method. To apply this workaround, you can subclass the softmax likelihood and overwrite the _log_prob
method.
class MySoftmax(Softmax):
def _log_prob(self, F, Y):
return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=tf.argmax(Y, axis=1))
...
likelihood = MySoftmax(num_classes)
With this workaround, the loss decreases smoothly and the accuracy on the training data approaches 1.0
.
Unfortunately, I'm not too much into the GPflow code and don't know what things this change would break.
just tried solution provided by @btlorch and seems to work well. I think it would be useful to document this somewhere in docs? @st-- saw the good first issue tag, happy to contribute if you think this is a sensible solution. Thank you!