keras icon indicating copy to clipboard operation
keras copied to clipboard

Keras 3 gives incorrect output from evaluate/fit in distributed context

Open drasmuss opened this issue 1 year ago • 17 comments

In Keras 3, changing the number of replicas during distributed training/evaluation changes the output of the model:

import tensorflow as tf

import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 4

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    post_gt = keras.losses.mean_squared_error(
        y, model.predict(x, batch_size=batch_size)
    )
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"evaluate: {eval}")
    print(f"post-fit output: {tf.reduce_mean(post_gt)}")

This gives output:

  • n_replicas=1:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=2:
ground truth: 0.43009480834007263
evaluate: 0.5054659843444824
post-fit output: 0.4298612177371979
  • n_replicas=4:
ground truth: 0.43009480834007263
evaluate: 0.5540136098861694
post-fit output: 0.4299061596393585

We can see that the ground truth is invariant to the number of replicas, as expected. But the loss value calculated by evaluate is incorrect for all n_replicas > 1. And this doesn't just impact the evaluation, we can see that fit results in a different change in the model output as we change the number of replicas.

If we switch to tf-keras, then we get the expected output regardless of the number of replicas:

  • n_replicas=1:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=2:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449
  • n_replicas=4:
ground truth: 0.43009480834007263
evaluate: 0.43009480834007263
post-fit output: 0.4297996461391449

drasmuss avatar Jun 20 '24 19:06 drasmuss

With a bit more investigation I figured out that what's going on is that evaluate is only reporting the loss from the first replica, and ignoring the rest. Here's an updated example demonstrating this:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 10), -1, 1, seed=1)

with tf.distribute.MirroredStrategy().scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(10)
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer="sgd")

    gt = keras.losses.mean_squared_error(y, model.predict(x, batch_size=batch_size))
    eval = model.evaluate(x, y, batch_size=batch_size)
    model.fit(x, y, batch_size=batch_size, epochs=1)
    print(f"ground truth: {tf.reduce_mean(gt)}")
    print(f"loss from first replica: {tf.reduce_mean(gt[:batch_size//n_replicas])}")
    print(f"evaluate: {eval}")

Which gives output:

  • n_replicas=1:
ground truth: 0.43009480834007263
loss from first replica: 0.43009480834007263
evaluate: 0.43009480834007263
  • n_replicas=2:
ground truth: 0.43009480834007263
loss from first replica: 0.5054659843444824
evaluate: 0.5054659843444824
  • n_replicas=4:
ground truth: 0.43009480834007263
loss from first replica: 0.5540136098861694
evaluate: 0.5540136098861694

drasmuss avatar Jun 21 '24 15:06 drasmuss

One more piece of investigation. I believe the above issue with evaluate is mainly a display issue. The model is computing the loss value correctly in each replica, but only the value from the first replica is being returned from evaluate.

I think that the reason the fit output changes as we change the number of replicas is that the weight updates are not being synced between replicas. For example:

import tensorflow as tf
import keras
# import tf_keras as keras

keras.utils.set_random_seed(0)

n_replicas = 2

gpus = tf.config.list_physical_devices("GPU")
tf.config.set_logical_device_configuration(
    gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] * n_replicas
)

batch_size = 12
local_batch_size = batch_size // n_replicas
x = tf.random.uniform((batch_size, 1), -1, 1, seed=0)
y = tf.random.uniform((batch_size, 1), -1, 1, seed=1)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inp = keras.Input(shape=(1,))
    layer = keras.layers.Dense(
        1, use_bias=False, kernel_initializer=keras.initializers.constant(1)
    )
    model = keras.Model(inp, layer(inp))
    model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1.0))

    model.fit(x, y, batch_size=batch_size, epochs=1)
    weights = strategy.run(lambda: layer.kernel.value).values
    print(f"per-replica weights: {[w.numpy() for w in weights]}")

We can see that each replica is maintaining independent weights:

  • n_replicas=1:
per-replica weights: [array([[0.6677284]], dtype=float32)]
  • n_replicas=2:
per-replica weights: [array([[0.82471704]], dtype=float32), array([[0.5107398]], dtype=float32)]
  • n_replicas=4:
per-replica weights: [array([[0.4283927]], dtype=float32), array([[1.2210413]], dtype=float32), array([[0.92960465]], dtype=float32), array([[0.09187502]], dtype=float32)]

If we switch to tf-keras, then the weights are synced across replicas, as expected:

  • n_replicas=1:
per-replica weights: [array([0.6677284], dtype=float32)]
  • n_replicas=2:
per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]
  • n_replicas=4:
per-replica weights: [array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32), array([[0.6677284]], dtype=float32)]

drasmuss avatar Jun 21 '24 18:06 drasmuss

Thank you for providing detailed investigation on the issue. We will look into it.

sachinprasadhs avatar Jun 21 '24 18:06 sachinprasadhs

I think I was able to get past this issue, but then I run into this bug https://github.com/keras-team/keras/issues/19246 so I can't really tell if things are working correctly or not.

drasmuss avatar Jun 24 '24 16:06 drasmuss

Hi @drasmuss!

Based on this comment it seems like you have been able to resolve the problem that was raised in this issue and there is already an open issue for the outstanding bug. I'm closing this issue then.

If possible, it would be great if you could add more information about how you were able to resolve the problem discussed in this issue. Thanks!

SamanehSaadat avatar Jun 28 '24 23:06 SamanehSaadat

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Jun 28 '24 23:06 google-ml-butler[bot]

No, the issue is not resolved. I had been working on a fix locally, but was unable to verify it due to that other bug. But this issue itself is still present, Keras gives incorrect output from fit and evaluate in a distributed context.

drasmuss avatar Jun 28 '24 23:06 drasmuss

I see! Thanks for clarifying! We're looking into the other bug that you linked here! After resolution of #19246, please let us know if the issue persisted!

SamanehSaadat avatar Jun 28 '24 23:06 SamanehSaadat

This issue will still require a pull request (or two) of its own to fix, it definitely won't be resolved on its own after #19246 is fixed.

drasmuss avatar Jun 28 '24 23:06 drasmuss

I see! I re-opened the issue! Is it the display issue you mentioned in https://github.com/keras-team/keras/issues/19891#issuecomment-2183215799

SamanehSaadat avatar Jun 28 '24 23:06 SamanehSaadat

I believe it's actually two separate issues (both requiring fixes). One is the wrong value being returned from evaluate. The other is that the gradient aggregation is not happening, so the distributed replicas are not sharing information at all during training (which essentially means that you're getting no benefit from the distributed training).

drasmuss avatar Jun 29 '24 00:06 drasmuss

Thanks for clarifying! I'll look into this!

SamanehSaadat avatar Jun 29 '24 00:06 SamanehSaadat

@drasmuss Have you looked into this any more? I've been seeing some weird behavior in my own work moving from a single-GPU to a multi-GPU set up on a different machine.

dryglicki avatar Sep 06 '24 10:09 dryglicki

I haven't had a chance to dig into it more. I believe there was an attempt to fix this here https://github.com/keras-team/keras/pull/19969, but then that was reverted so I'm not sure what the current status is.

drasmuss avatar Sep 06 '24 12:09 drasmuss

Looks like that change was un-reverted in https://github.com/keras-team/keras/pull/20051. However, after doing some quick testing in Keras 3.5, the behaviour is different but still incorrect. Running the same script from here https://github.com/keras-team/keras/issues/19891#issuecomment-2183215799 gives output:

  • n_replicas=1
per-replica weights: [array([[0.6677284]], dtype=float32)]
  • n_replicas=2
per-replica weights: [array([[0.33545685]], dtype=float32), array([[0.33545685]], dtype=float32)]
  • n_replicas=4
per-replica weights: [array([[-0.3290863]], dtype=float32), array([[-0.3290863]], dtype=float32), array([[-0.3290863]], dtype=float32), array([[-0.3290863]], dtype=float32)]

So the weights are at least being synced across replicas now. But changing the number of replicas changes the outcome of the training, which definitely shouldn't happen (and doesn't happen in tf-keras). So there's still something quite broken about distributed training in Keras 3.

drasmuss avatar Sep 06 '24 13:09 drasmuss

Distributed training is broken in keras3. Please look and increase prio. I am pretty sure variable aggregation and synchronization is not applied correctly.

On further digging I used tf.Variable and seemed to have fixed the issue with this:

class Mean(Metric):
    def __init__(self, name="mean", dtype=None):
        super().__init__(name=name, dtype=dtype)
        self.total = tf.Variable(
            0.0,
            dtype=self.dtype,
            name="total",
            aggregation=tf.VariableAggregation.SUM,
        )
        self.count = tf.Variable(
            0.0,
            dtype=self.dtype,
            name="count",
            aggregation=tf.VariableAggregation.SUM,
        )
        ...

Metric variable is ignoring self.aggregation property

rivershah avatar Oct 17 '24 14:10 rivershah

@fchollet This is a simple bug but keras 3 distribution metrics reporting is definitely broken due to this. Consequently learning rate scheduling, early stopping and other training steps are also broken. The problem becomes severe when number of accelerators becomes large.

Requesting that we please review unit testing setup for distributed parallel and mesh parallel so such errors get caught in CI runs before pushing releases.

rivershah avatar Oct 23 '24 12:10 rivershah

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Nov 23 '24 17:11 google-ml-butler[bot]

Thanks for looking into this @james77777778 / @fchollet, but I don't think this was fixed in #20541. When I run the test script from https://github.com/keras-team/keras/issues/19891#issuecomment-2183215799 on the latest master commit I still see the same incorrect behaviour as in https://github.com/keras-team/keras/issues/19891#issuecomment-2334007365, where model.fit gives different behaviour depending on the number of replicas, which shouldn't be the case.

drasmuss avatar Nov 25 '24 16:11 drasmuss

@james77777778 will take a look soon! Sorry for the slow action this week, things are quiet with the US holiday.

mattdangerw avatar Nov 27 '24 21:11 mattdangerw

Hi @drasmuss

Sorry for the delay! I copied your code from https://github.com/keras-team/keras/issues/19891#issue-2365146253 into this colab and evaluate is returning 0.43009480834007263 as expected when Keras version is 3.7.0.

If you're getting a different result, could you share a repro colab?

SamanehSaadat avatar Dec 06 '24 23:12 SamanehSaadat

The updated test script you need to run is here https://github.com/keras-team/keras/issues/19891#issuecomment-2183215799, it doesn't involve evaluate.

drasmuss avatar Dec 07 '24 02:12 drasmuss

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Dec 09 '24 03:12 google-ml-butler[bot]

Can confirm that this was fixed in #20609, thanks very much @james77777778!

drasmuss avatar Dec 09 '24 13:12 drasmuss

Good to hear that!

james77777778 avatar Dec 09 '24 14:12 james77777778