keras intermittent test case failed on tensorflow gpu env

On keras/src/trainers/data_adapters/generator_data_adapter_test.py, I found that there is intermittent test case failed on tensorflow gpu env. This is related in test_basic_flow method on this test case, so, I made test code for this on my local side.

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import math

import jax
import numpy as np
import tensorflow as tf
import torch
from absl.testing import parameterized
from jax import numpy as jnp

from keras.src import backend
from keras.src import testing
from keras.src.trainers.data_adapters import generator_data_adapter

def example_generator(x, y, sample_weight=None, batch_size=32):
    def make():
        for i in range(math.ceil(len(x) / batch_size)):
            low = i * batch_size
            high = min(low + batch_size, len(x))
            batch_x = x[low:high]
            batch_y = y[low:high]
            if sample_weight is not None:
                yield batch_x, batch_y, sample_weight[low:high]
            else:
                yield batch_x, batch_y

    return make


class TestCase(testing.TestCase, parameterized.TestCase):

    def test_basic_flow(self, use_sample_weight, generator_type):
        x = np.random.random((34, 4)).astype("float32")
        y = np.array([[i, i] for i in range(34)], dtype="float32")
        sw = np.random.random((34,)).astype("float32")
        if generator_type == "tf":
            x, y, sw = tf.constant(x), tf.constant(y), tf.constant(sw)
        elif generator_type == "jax":
            x, y, sw = jnp.array(x), jnp.array(y), jnp.array(sw)
        elif generator_type == "torch":
            x, y, sw = (
                torch.as_tensor(x),
                torch.as_tensor(y),
                torch.as_tensor(sw),
            )
        if not use_sample_weight:
            sw = None
        make_generator = example_generator(
            x,
            y,
            sample_weight=sw,
            batch_size=16,
        )

        adapter = generator_data_adapter.GeneratorDataAdapter(make_generator())
        if backend.backend() == "numpy":
            it = adapter.get_numpy_iterator()
            expected_class = np.ndarray
        elif backend.backend() == "tensorflow":
            it = adapter.get_tf_dataset()
            expected_class = tf.Tensor
        elif backend.backend() == "jax":
            it = adapter.get_jax_iterator()
            expected_class = (
                jax.Array if generator_type == "jax" else np.ndarray
            )
        elif backend.backend() == "torch":
            it = adapter.get_torch_dataloader()
            expected_class = torch.Tensor

        sample_order = []
        for i, batch in enumerate(it):
            if use_sample_weight:
                self.assertEqual(len(batch), 3)
                bx, by, bsw = batch
            else:
                self.assertEqual(len(batch), 2)
                bx, by = batch
            self.assertIsInstance(bx, expected_class)
            self.assertIsInstance(by, expected_class)
            self.assertEqual(bx.dtype, by.dtype)
            self.assertContainsExactSubsequence(str(bx.dtype), "float32")
            if i < 2:
                self.assertEqual(bx.shape, (16, 4))
                self.assertEqual(by.shape, (16, 2))
            else:
                self.assertEqual(bx.shape, (2, 4))
                self.assertEqual(by.shape, (2, 2))
            if use_sample_weight:
                self.assertIsInstance(bsw, expected_class)
            for i in range(by.shape[0]):
                sample_order.append(by[i, 0])
        self.assertAllClose(sample_order, list(range(34)))

        print(f"*" * 50)


for _ in range(1000):
    TestCase().test_basic_flow(True, 'tf')
    print("All passed!")

And I got an error as below, many of running was succeeded, however, some are failed.

InvalidArgumentError                      Traceback (most recent call last)
Cell In[2], line 85
     81         print(f"*" * 50)
     84 for _ in range(1000):
---> 85     TestCase().test_basic_flow(True, 'tf')
     86     print("All passed!")

Cell In[2], line 78, in TestCase.test_basic_flow(self, use_sample_weight, generator_type)
     76         self.assertIsInstance(bsw, expected_class)
     77     for i in range(by.shape[0]):
---> 78         sample_order.append(by[i, 0])
     79 self.assertAllClose(sample_order, list(range(34)))
     81 print(f"*" * 50)

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:5983, in raise_from_not_ok_status(e, name)
   5981 def raise_from_not_ok_status(e, name) -> NoReturn:
   5982   e.message += (" name: " + str(name if name is not None else ""))
-> 5983   raise core._status_to_exception(e) from None

InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected begin, end, and strides to be 1D equal size tensors, but got shapes [2], [1], and [1] instead. [Op:StridedSlice] name: strided_slice/

So, is there anyone can confirm whether this is the bug or not??

Jul 22 '24 15:07 shashaka

I tried it in colab GPU runtime with the code you have provided, for all the 1000 runs I got "All Passed" message. Attaching the Gist here for reference

Jul 22 '24 21:07 sachinprasadhs

@sachinprasadhs When I install the keras from the source as github master branch, I found that this issue was reproduced. Can you check the below colab notebook?

https://colab.sandbox.google.com/gist/sachinprasadhs/e73e2c7428f44ccc0d2ef486bed047c6/20027.ipynb

Jul 23 '24 00:07 shashaka

Hi @shashaka, could we try and get a pared down colab of this issue? Please remove anything non-relevant to Tensorflow and to this reproduction. Please also add keras.config.disable_traceback_filtering() so we can get a full trace error.

Jul 23 '24 16:07 grasskin

Here is a simplified gist (shows the error)

(With disabled traceback filtering.)

Happens both with GPU and CPU. (It does happen only some times !)

PS: This might be obvious but without the test environment there seems to be no error (gist).

Jul 24 '24 09:07 newresu

I also updated my gist based on @ghsanti 's one. It seems that this error occurred when slicing the data on data generator.

https://colab.research.google.com/gist/shashaka/71e1e97d1459498c0bcca1fb4fc084d8/20027.ipynb

Jul 24 '24 10:07 shashaka

Thank you @shashaka and @ghsanti, unless this shows up in our own testing environment (internally/github CI) we are unlikely to have the bandwidth to dive deeper into what is happening since this might be environment specific. If you're taking a closer look and find the code pointer responsible we'd be happy to support any PR's. Leaving open for now!

Jul 25 '24 17:07 grasskin

Are you satisfied with the resolution of your issue? Yes No

Feb 06 '25 02:02 google-ml-butler[bot]

keras keras copied to clipboard

intermittent test case failed on tensorflow gpu env

keras
keras copied to clipboard