keras
keras copied to clipboard
intermittent test case failed on tensorflow gpu env
On keras/src/trainers/data_adapters/generator_data_adapter_test.py, I found that there is intermittent test case failed on tensorflow gpu env. This is related in test_basic_flow method on this test case, so, I made test code for this on my local side.
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import math
import jax
import numpy as np
import tensorflow as tf
import torch
from absl.testing import parameterized
from jax import numpy as jnp
from keras.src import backend
from keras.src import testing
from keras.src.trainers.data_adapters import generator_data_adapter
def example_generator(x, y, sample_weight=None, batch_size=32):
def make():
for i in range(math.ceil(len(x) / batch_size)):
low = i * batch_size
high = min(low + batch_size, len(x))
batch_x = x[low:high]
batch_y = y[low:high]
if sample_weight is not None:
yield batch_x, batch_y, sample_weight[low:high]
else:
yield batch_x, batch_y
return make
class TestCase(testing.TestCase, parameterized.TestCase):
def test_basic_flow(self, use_sample_weight, generator_type):
x = np.random.random((34, 4)).astype("float32")
y = np.array([[i, i] for i in range(34)], dtype="float32")
sw = np.random.random((34,)).astype("float32")
if generator_type == "tf":
x, y, sw = tf.constant(x), tf.constant(y), tf.constant(sw)
elif generator_type == "jax":
x, y, sw = jnp.array(x), jnp.array(y), jnp.array(sw)
elif generator_type == "torch":
x, y, sw = (
torch.as_tensor(x),
torch.as_tensor(y),
torch.as_tensor(sw),
)
if not use_sample_weight:
sw = None
make_generator = example_generator(
x,
y,
sample_weight=sw,
batch_size=16,
)
adapter = generator_data_adapter.GeneratorDataAdapter(make_generator())
if backend.backend() == "numpy":
it = adapter.get_numpy_iterator()
expected_class = np.ndarray
elif backend.backend() == "tensorflow":
it = adapter.get_tf_dataset()
expected_class = tf.Tensor
elif backend.backend() == "jax":
it = adapter.get_jax_iterator()
expected_class = (
jax.Array if generator_type == "jax" else np.ndarray
)
elif backend.backend() == "torch":
it = adapter.get_torch_dataloader()
expected_class = torch.Tensor
sample_order = []
for i, batch in enumerate(it):
if use_sample_weight:
self.assertEqual(len(batch), 3)
bx, by, bsw = batch
else:
self.assertEqual(len(batch), 2)
bx, by = batch
self.assertIsInstance(bx, expected_class)
self.assertIsInstance(by, expected_class)
self.assertEqual(bx.dtype, by.dtype)
self.assertContainsExactSubsequence(str(bx.dtype), "float32")
if i < 2:
self.assertEqual(bx.shape, (16, 4))
self.assertEqual(by.shape, (16, 2))
else:
self.assertEqual(bx.shape, (2, 4))
self.assertEqual(by.shape, (2, 2))
if use_sample_weight:
self.assertIsInstance(bsw, expected_class)
for i in range(by.shape[0]):
sample_order.append(by[i, 0])
self.assertAllClose(sample_order, list(range(34)))
print(f"*" * 50)
for _ in range(1000):
TestCase().test_basic_flow(True, 'tf')
print("All passed!")
And I got an error as below, many of running was succeeded, however, some are failed.
InvalidArgumentError Traceback (most recent call last)
Cell In[2], line 85
81 print(f"*" * 50)
84 for _ in range(1000):
---> 85 TestCase().test_basic_flow(True, 'tf')
86 print("All passed!")
Cell In[2], line 78, in TestCase.test_basic_flow(self, use_sample_weight, generator_type)
76 self.assertIsInstance(bsw, expected_class)
77 for i in range(by.shape[0]):
---> 78 sample_order.append(by[i, 0])
79 self.assertAllClose(sample_order, list(range(34)))
81 print(f"*" * 50)
File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
File ~/miniconda3/envs/keras/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:5983, in raise_from_not_ok_status(e, name)
5981 def raise_from_not_ok_status(e, name) -> NoReturn:
5982 e.message += (" name: " + str(name if name is not None else ""))
-> 5983 raise core._status_to_exception(e) from None
InvalidArgumentError: {{function_node __wrapped__StridedSlice_device_/job:localhost/replica:0/task:0/device:GPU:0}} Expected begin, end, and strides to be 1D equal size tensors, but got shapes [2], [1], and [1] instead. [Op:StridedSlice] name: strided_slice/
So, is there anyone can confirm whether this is the bug or not??
I tried it in colab GPU runtime with the code you have provided, for all the 1000 runs I got "All Passed" message.
Attaching the Gist here for reference
@sachinprasadhs When I install the keras from the source as github master branch, I found that this issue was reproduced. Can you check the below colab notebook?
https://colab.sandbox.google.com/gist/sachinprasadhs/e73e2c7428f44ccc0d2ef486bed047c6/20027.ipynb
Hi @shashaka, could we try and get a pared down colab of this issue? Please remove anything non-relevant to Tensorflow and to this reproduction. Please also add keras.config.disable_traceback_filtering() so we can get a full trace error.
Here is a simplified gist (shows the error)
(With disabled traceback filtering.)
Happens both with GPU and CPU. (It does happen only some times !)
PS: This might be obvious but without the test environment there seems to be no error (gist).
I also updated my gist based on @ghsanti 's one. It seems that this error occurred when slicing the data on data generator.
https://colab.research.google.com/gist/shashaka/71e1e97d1459498c0bcca1fb4fc084d8/20027.ipynb
Thank you @shashaka and @ghsanti, unless this shows up in our own testing environment (internally/github CI) we are unlikely to have the bandwidth to dive deeper into what is happening since this might be environment specific. If you're taking a closer look and find the code pointer responsible we'd be happy to support any PR's. Leaving open for now!