tf-yarn Running on GPU: CUDA_ERROR_NO

Hello

I have a properly configured GPU node with Nvidia / Cuda drivers as well as the Cuda toolkit. nvidia-smi as well as Cuda samples such as deviceQuery and bandwithTest run.

Tensorflow locally executed detects the GPU device with python -c "import tensorflow as tf;tf.config.list_physical_devices('GPU')"

As described here the Yarn node label “gpu” exits and is associated to above node.

For test purposes I modified keras_example.py as follows:

task_specs={
         "chief": TaskSpec(memory="2 GiB", vcores=4),
          "worker": TaskSpec(memory="2 GiB", vcores=4, instances=1, label=NodeLabel.GPU),
          "ps": TaskSpec(memory="2 GiB", vcores=4, instances=2),
         "evaluator": TaskSpec(memory="2 GiB", vcores=1)
},
queue="ml-gpu"

The worker.log shows that no GPU has been detected:

+ ./venv.pex -m tf_yarn.tasks._independent_workers_task
INFO:tf_yarn._task_commons: Python 3.6.8 (default, Sep 26 2019, 11:57:09)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
INFO:tf_yarn._task_commons: Skein 0.8.0
INFO:tf_yarn._task_commons: TensorFlow v2.2.0-rc4-8-g2b96f3662b 2.2.0
I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: <hostname>
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: <hostname-removed>
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.56.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.56.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.56.0
I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2095074999 Hz

Neither nvidia-smi nor YARN RM ui show processes on the GPU. Hence the CPU is used for processing. Any ideas or hints how to further debug and solve this issue?

Many thanks in advance!

Oct 20 '20 09:10 cguegi

Does it use the GPU when you run your training code locally ?

You can also try to set CUDA_VISIBLE_DEVICES to see if that changes anything.

run_on_yarn(
  env= {"CUDA_VISIBLE_DEVICES ": "0"}
)

I would also execute list_physical_devices somewhere in your experiment function (or using directly this https://github.com/criteo/cluster-pack/tree/master/examples/interactive-mode)

print(tf.config.list_physical_devices('GPU'))

Oct 23 '20 17:10 fhoering

Hello,

Could it be a mismatch between CUDA and TensorFlow versions? For example, there is CUDA 9.0 and TensorFlow 1.6 that requires CUDA 10.0

# some CUDA configured
Successfully opened dynamic library libcuda.so.1 
# however, no compatible devide found
E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Nov 02 '20 09:11 akimboyko

@fhoering as suggested I've used the interactive mode, unfortunately without success. Below the code executed on a Hadoop cluster with Tensorflow 2.3.1.

Software installed on GPU Datanode:

Cuda version: 10.2.89
Nvidia driver version: 440.56
cuDNN version: 8.0.3.33
Python: 3.6.8
GCC version: 4.8.5

import tensorflow as tf
import os

def compute_intersection():
  print("TF version: " + tf.__version__)
  lib_dir =os.environ['LD_LIBRARY_PATH']
  print(f'lib directory: {lib_dir}')
  print('Cuda device: ', os.environ['CUDA_VISIBLE_DEVICES'])
  print("GPU: ", tf.config.list_physical_devices('GPU'))

import cluster_pack
package_path, _ = cluster_pack.upload_env()

from cluster_pack.skein import skein_config_builder
skein_config = skein_config_builder.build_with_func(
    func=compute_intersection,
    package_path=package_path
)

import skein
with skein.Client(log_level="DEBUG") as client:
    service = skein.Service(
        resources=skein.Resources("1 GiB", 1),
        files=skein_config.files,
        script=skein_config.script,
       env={
          "LD_LIBRARY_PATH": "/usr/local/cuda-10.2/lib64/:/usr/local/cuda-10.2/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/      x86_64-linux/lib:/usr/lib64",
          "TF_CPP_MIN_LOG_LEVEL": "0",
          "CUDA_VISIBLE_DEVICES": "0",
          "PATH": "/usr/local/cuda-10.2/bin:$PATH"
    }
    )
    master = skein.Master(
    log_level="DEBUG"
  )
    spec = skein.ApplicationSpec(services={"service": service},queue="ml-gpu",node_label="gpu",name="cuda-detection",            master=master)
    app_id = client.submit(spec)

Yarn container log:

Container: container_e54_1603984700251_0019_01_000002 on <hostname>

LogAggregationType: AGGREGATED

============================================================================

LogType:service.log

LogLastModifiedTime:Wed Nov 04 15:26:19 +0100 2020

LogLength:1392

LogContents:

running ./venv.pex -m cluster_pack.skein._execute_fun function_7790ceef-e14d-4f33-a0ce-9fc46b3a5f08.dat INFO ..
2020-11-04 15:26:15.567171: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-04 15:26:19.601619: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-04 15:26:19.603672: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-11-04 15:26:19.603722: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: <hostname>
2020-11-04 15:26:19.603733: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: <hostname>
2020-11-04 15:26:19.603814: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.56.0
2020-11-04 15:26:19.603855: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.56.0
2020-11-04 15:26:19.603867: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.56.0

TF version: 2.3.1
lib directory: /usr/local/cuda-10.2/lib64/:/usr/local/cuda-10.2/extras/CUPTI/lib64:/usr/local/cuda-10.2/targets/x86_64-linux/lib:/usr/lib64
Cuda device:  0
GPU:  []

End of LogType:service.log
***************************************************************************

Nov 04 '20 14:11 cguegi

Hi @akimboyko, I don't know if this is a compatibility issue between Cuda and Tensorflow. Cuda 10.2 is not mentioned in https://www.tensorflow.org/install/source#gpu, however, I've read that Cuda 10.2 is compatible with 10.1.

Nov 04 '20 15:11 cguegi

I downloaded the pex file from HDFS and executed it on the Datanode.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64/
/tmp/venv.pex -c "import tensorflow as tf;tf.config.list_physical_devices('GPU')"

It works and that's the output. Why doesn't it work with skein respectively with tf-yarn?

2020-11-04 16:47:19.215890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-04 16:47:19.220831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.221319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:02:01.0 name: GRID T4-8C computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-04 16:47:19.221559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-04 16:47:19.223706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-04 16:47:19.225975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-04 16:47:19.226300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-04 16:47:19.228811: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-04 16:47:19.230039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-04 16:47:19.230248: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-04 16:47:19.230348: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.230853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-04 16:47:19.231250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0

Nov 04 '20 15:11 cguegi

@cguegi Was this the issue you had or is this issue here still different ? https://github.com/jcrist/skein/pull/224 (It should only apply to a hadoop 3 cluster)

Nov 23 '20 08:11 fhoering

Running on GPU: CUDA_ERROR_NO_DEVICE