open_spiel icon indicating copy to clipboard operation
open_spiel copied to clipboard

python/examples/alpha_zero.py crashes with `CUDA_ERROR_NOT_INITIALIZED`

Open jthemphill opened this issue 8 months ago • 3 comments

I'm running Ubuntu 22.04 WSL2, and I've tried running this with both tensorflow==2.14.0 and tf-nightly==2.15.0.dev20231010. I am using Python 3.11.5, which is supported by the latest version of Tensorflow.

You can correctly install Tensorflow with GPU support via pip install --extra-index-url https://pypi.nvidia.com tensorflow[and-cuda], or install the nightly version with pip install --extra-index-url https://pypi.nvidia.com tf-nightly[and-cuda]. Note that, without the extra-index-url flag, the installation will fail as Tensorflow 2.14.0 depends on specific versions of tensorrt and tensorrt-lib which are not in the public pypi repository.

I verified that my graphics card is visible to the WSL2 container:

~/open_spiel$ nvidia-smi
Tue Oct 10 22:58:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.120                Driver Version: 537.58       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        On  | 00000000:01:00.0  On |                  N/A |
| 35%   52C    P0              35W / 180W |    962MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

And I verified that Tensorflow itself runs code correctly with my GPU, by running this code, seeing results, and noting the spike in my GPU's utilization when I run this script:

import tensorflow as tf

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

But even though tensorflow is working with my graphics card, alpha_zero.py fails:

~/open_spiel$ python open_spiel/python/examples/alpha_zero.py 
2023-10-10 22:51:42.689219: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 22:51:42.689281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 22:51:42.690266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 22:51:42.695684: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-10 22:51:43.360880: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-10 22:51:44.101936: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127175: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127253: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Starting game connect_four
Writing logs and checkpoints to: /tmp/az-2023-10-10-22-51-connect_four-87c21nuk
Model type: resnet(128, 10)
actor-0 started
actor-1 started
learner started
[2023-10-10 22:51:44.141] Initializing model
evaluator-0 started
Exception caught in evaluator-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
evaluator-0 exiting
Process Process-3:
Exception caught in actor-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-0 exiting
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 287, in evaluator
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-10 22:51:44.231365: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231499: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231562: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Exception caught in actor-1: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-1 exiting
Process Process-2:
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
^C2023-10-10 22:51:45.582786: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2017] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-10 22:51:45.582959: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.583002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6562 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
[2023-10-10 22:51:45.587] learner exiting
learner exiting

<hangs at 0% GPU usage>
Caught a KeyboardInterrupt, stopping early.

AlphaZero is forking actor, evaluator, and learner processes, and it's these subprocesses which fail, so I believe this is related to https://github.com/tensorflow/tensorflow/issues/57877.

jthemphill avatar Oct 11 '23 06:10 jthemphill