open_spiel
open_spiel copied to clipboard
python/examples/alpha_zero.py crashes with `CUDA_ERROR_NOT_INITIALIZED`
I'm running Ubuntu 22.04 WSL2, and I've tried running this with both tensorflow==2.14.0
and tf-nightly==2.15.0.dev20231010
. I am using Python 3.11.5
, which is supported by the latest version of Tensorflow.
You can correctly install Tensorflow with GPU support via pip install --extra-index-url https://pypi.nvidia.com tensorflow[and-cuda]
, or install the nightly version with pip install --extra-index-url https://pypi.nvidia.com tf-nightly[and-cuda]
. Note that, without the extra-index-url flag, the installation will fail as Tensorflow 2.14.0 depends on specific versions of tensorrt
and tensorrt-lib
which are not in the public pypi repository.
I verified that my graphics card is visible to the WSL2 container:
~/open_spiel$ nvidia-smi
Tue Oct 10 22:58:49 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.120 Driver Version: 537.58 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1080 On | 00000000:01:00.0 On | N/A |
| 35% 52C P0 35W / 180W | 962MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 20 G /Xwayland N/A |
| 0 N/A N/A 20 G /Xwayland N/A |
| 0 N/A N/A 23 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+
And I verified that Tensorflow itself runs code correctly with my GPU, by running this code, seeing results, and noting the spike in my GPU's utilization when I run this script:
import tensorflow as tf
tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)
But even though tensorflow is working with my graphics card, alpha_zero.py
fails:
~/open_spiel$ python open_spiel/python/examples/alpha_zero.py
2023-10-10 22:51:42.689219: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 22:51:42.689281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 22:51:42.690266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 22:51:42.695684: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-10 22:51:43.360880: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-10 22:51:44.101936: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127175: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127253: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Starting game connect_four
Writing logs and checkpoints to: /tmp/az-2023-10-10-22-51-connect_four-87c21nuk
Model type: resnet(128, 10)
actor-0 started
actor-1 started
learner started
[2023-10-10 22:51:44.141] Initializing model
evaluator-0 started
Exception caught in evaluator-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
evaluator-0 exiting
Process Process-3:
Exception caught in actor-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-0 exiting
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
return fn(config=config, logger=logger, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 287, in evaluator
model = _init_model_from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
return model_lib.Model.build_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
cls._define_graph(model_type, input_shape, output_size, nn_width,
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
torso = cascade(observations, [
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
x = fn(x)
^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
applied = bn(x, training)
^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
return converted_call(f, args, kwargs, options=options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
return _call_unconverted(f, args, kwargs, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
outputs = self._fused_batch_norm(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
output, mean, variance = control_flow_util.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
return tf.__internal__.smart_cond.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
return fn(config=config, logger=logger, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
model = _init_model_from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
pywrap_tfe.TFE_DeleteContextOptions(opts)
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
return model_lib.Model.build_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
cls._define_graph(model_type, input_shape, output_size, nn_width,
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
torso = cascade(observations, [
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
x = fn(x)
^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
applied = bn(x, training)
^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
return converted_call(f, args, kwargs, options=options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
return _call_unconverted(f, args, kwargs, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
outputs = self._fused_batch_norm(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
output, mean, variance = control_flow_util.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
return tf.__internal__.smart_cond.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-10 22:51:44.231365: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231499: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231562: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Exception caught in actor-1: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-1 exiting
Process Process-2:
Traceback (most recent call last):
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
return fn(config=config, logger=logger, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
model = _init_model_from_config(config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
return model_lib.Model.build_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
cls._define_graph(model_type, input_shape, output_size, nn_width,
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
torso = cascade(observations, [
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
x = fn(x)
^^^^^
File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
applied = bn(x, training)
^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
return converted_call(f, args, kwargs, options=options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
return _call_unconverted(f, args, kwargs, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
outputs = self._fused_batch_norm(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
output, mean, variance = control_flow_util.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
return tf.__internal__.smart_cond.smart_cond(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
^C2023-10-10 22:51:45.582786: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2017] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-10-10 22:51:45.582959: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.583002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6562 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
[2023-10-10 22:51:45.587] learner exiting
learner exiting
<hangs at 0% GPU usage>
Caught a KeyboardInterrupt, stopping early.
AlphaZero is forking actor, evaluator, and learner processes, and it's these subprocesses which fail, so I believe this is related to https://github.com/tensorflow/tensorflow/issues/57877.