U-Time icon indicating copy to clipboard operation
U-Time copied to clipboard

ut train fails on GPU with JIT compilation failed

Open Shubhamcl opened this issue 1 year ago • 0 comments

Using ubuntu 22, training works fine on CPU but when --num_gpus=1 I get this error stack.

This stack appears on following the instructions for the demo.

I first thought it is a tensorflow issue so I ran training on GPU using example from tensorflow tutorials, but that worked fine.

Detected at node 'SelectV2' defined at (most recent call last): File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 908, in _bootstrap self._bootstrap_inner() File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 950, in _bootstrap_inner self.run() File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step outputs = model.train_step(data) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 864, in train_step return self.compute_metrics(x, y, y_pred, sample_weight) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 957, in compute_metrics self.compiled_metrics.update_state(y, y_pred, sample_weight) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 459, in update_state metric_obj.update_state(y_t, y_p, sample_weight=mask) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/utime/evaluation/utils.py", line 22, in wrapper mask = tf.where(tf.logical_and( Node: 'SelectV2' Detected at node 'SelectV2' defined at (most recent call last): File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 908, in _bootstrap self._bootstrap_inner() File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/threading.py", line 950, in _bootstrap_inner self.run() File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step outputs = model.train_step(data) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 864, in train_step return self.compute_metrics(x, y, y_pred, sample_weight) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/training.py", line 957, in compute_metrics self.compiled_metrics.update_state(y, y_pred, sample_weight) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 459, in update_state metric_obj.update_state(y_t, y_p, sample_weight=mask) File "/home/shubham/anaconda3/envs/u-sleep/lib/python3.9/site-packages/utime/evaluation/utils.py", line 22, in wrapper mask = tf.where(tf.logical_and( Node: 'SelectV2' 2 root error(s) found. (0) UNKNOWN: JIT compilation failed. [[{{node SelectV2}}]] [[div_no_nan_1/ReadVariableOp/_12]] (1) UNKNOWN: JIT compilation failed. [[{{node SelectV2}}]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_13068]

Shubhamcl avatar Apr 21 '23 09:04 Shubhamcl