GPflow icon indicating copy to clipboard operation
GPflow copied to clipboard

Multi-class classification with softmax likelihood

Open explicat opened this issue 4 years ago • 4 comments

Hi there,

since this is a "how do I do" question, I first posted this on Stackoverflow, but having received no response there, I am copying this question here.

The GPflow docs provide an example for multi-class classification with the robust-max function. I am trying to train a multi-class classifier with the softmax likelihood instead, which is also implemented in GPflow but I can't find any documentation or examples on how to use it correctly.

Please find an example of what I have tried below. During training, the loss decreases smoothly.

The robust-max example mentioned above uses categorical labels, i.e., values 0, 1, 2, but simply replacing the robust-max by the softmax likelihood raises an IndexError in the quadrature method. Therefore, I assume this model with the softmax likelihood requires one-hot encoded labels. At test time, however, I noticed that for this three-class toy example the model never predicts class 3. Upon closer inspection, the softmax likelihood has the following method

def _log_prob(self, F, Y):
        return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=Y[:, 0])

which looks like it expects an array of categorical labels of shape [num_samples, 1].

What is the right way to use the softmax likelihood for GP multi-class classification?

import numpy as np
import tensorflow as tf
import gpflow
from gpflow.likelihoods.multiclass import Softmax
from tqdm.auto import tqdm


np.random.seed(0)
tf.random.set_seed(123)

# Number of functions and number of data points
num_classes = 3
N = 100

# Create training data
# Jitter
jitter_eye = np.eye(N) * 1e-6

# Input
X = np.random.rand(N, 1)

# SquaredExponential kernel matrix
kernel_se = gpflow.kernels.SquaredExponential(lengthscales=0.1)
K = kernel_se(X) + jitter_eye

# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(num_classes)).T

# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)

# One-hot encoding
Y_hot = np.zeros((N, num_classes), dtype=np.int)
Y_hot[np.arange(N), Y] = 1

data = (X, Y_hot)

# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
likelihood = Softmax(num_classes)
m = gpflow.models.VGP(
    data=data,
    kernel=kernel,
    likelihood=likelihood,
    num_latent_gps=num_classes,
)


def run_adam(model, iterations):
    """
    Utility function running the Adam optimizer
    """

    # Create an Adam Optimizer action
    losses = []
    training_loss = model.training_loss
    optimizer = tf.optimizers.Adam()

    @tf.function
    def optimization_step():
        optimizer.minimize(training_loss, model.trainable_variables)

    for step in tqdm(range(iterations), total=iterations):
        optimization_step()
        if step % 10 == 0:
            elbo = -training_loss().numpy()
            losses.append(elbo)
    return losses


run_adam(model=m, iterations=10000)
y_pred = m.predict_y(X)[0]
print("Training accuracy: {:3.2f}".format(np.mean(Y == np.argmax(y_pred, axis=1))))

explicat avatar Oct 06 '20 20:10 explicat

It seems that the inference procedure is not ``stochastic'': in each iteration, the whole dataset is used to calculate the ELBO. Therefore, adam here seems not necessary......

ltbyron avatar Dec 03 '20 07:12 ltbyron

Softmax uses Monte Carlo to estimate the expected log likelihood, so you do always need Adam. The nice thing about RobustMax is that for the full dataset, you can use bfgs. My experience is that RobustMax isn't that great from a statistical perspective though...

I can never remember what form the labels should be passed in, and it's very possible that there is a difference between RobustMax and SoftMax. You found the correct method, and I would refer to the TensorFlow documentation of tf.nn.sparse_softmax_cross_entropy_with_logits to determine what the right format is.

Hope this helps.

markvdw avatar Dec 03 '20 16:12 markvdw

Thanks for the hints. The method tf.nn.sparse_softmax_cross_entropy_with_logits expects categorical labels, i.e., integer values in the range [0, num_classes). If I interpret the line D_out = Y.shape[1] correctly, the quadrature method expects one-hot encoded labels, though.

The code example above seems to work as expected when I replace labels=Y[:, 0] with labels=tf.argmax(Y, axis=1) in the softmax's _log_prob method. To apply this workaround, you can subclass the softmax likelihood and overwrite the _log_prob method.

class MySoftmax(Softmax):
    def _log_prob(self, F, Y):
        return -tf.nn.sparse_softmax_cross_entropy_with_logits(logits=F, labels=tf.argmax(Y, axis=1))

...

likelihood = MySoftmax(num_classes)

With this workaround, the loss decreases smoothly and the accuracy on the training data approaches 1.0.

Unfortunately, I'm not too much into the GPflow code and don't know what things this change would break.

btlorch avatar Dec 03 '20 22:12 btlorch

just tried solution provided by @btlorch and seems to work well. I think it would be useful to document this somewhere in docs? @st-- saw the good first issue tag, happy to contribute if you think this is a sensible solution. Thank you!

giovp avatar Jun 04 '21 15:06 giovp