keras-yolo3 icon indicating copy to clipboard operation
keras-yolo3 copied to clipboard

Training doesn't utilize GPU

Open monocongo opened this issue 4 years ago • 6 comments

I am performing training of the model using a custom dataset on an AWS EC2 instance (p2.xlarge) with an NVIDIA Tesla K80 GPU. After launching the training script I see full CPU utilization but no utilization of the GPU, as measured by the output of $ watch -n0.1 nvidia-smi.

Sun Aug 11 23:04:01 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   55C    P0    58W / 149W |     67MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4546      C   python3                                       56MiB |
+-----------------------------------------------------------------------------+

The EC2 instance is Ubuntu 18.04 with nvidia-driver-430 installed.

The config.json file:

{
    "model" : {
        "min_input_size":       288,
        "max_input_size":       448,
        "anchors":              [0,0, 58,58, 114,193, 116,73, 193,123, 210,270, 303,187, 341,282, 373,367],
        "labels":               ["handgun"]
    },

    "train": {
        "train_image_folder":   "/home/ubuntu/data/yolo3/handgun/images/",
        "train_annot_folder":   "/home/ubuntu/data/yolo3/handgun/annotations/",
        "cache_name":           "handgun_train.pkl",

        "train_times":          8,
        "batch_size":           16,
        "learning_rate":        1e-4,
        "nb_epochs":            100,
        "warmup_epochs":        3,
        "ignore_thresh":        0.5,
        "gpus":                 "0",

        "grid_scales":          [1,1,1],
        "obj_scale":            5,
        "noobj_scale":          1,
        "xywh_scale":           1,
        "class_scale":          1,

        "tensorboard_dir":      "logs",
        "saved_weights_name":   "handgun.h5",
        "debug":                true
    },

    "valid": {
        "valid_image_folder":   "",
        "valid_annot_folder":   "",
        "cache_name":           "",

        "valid_times":          1
    }
}

The output from the training script looks reasonable and the TensorBoard graphs look good (i.e. continuous drops in the loss graphs). My only concern is that I've not done something correctly in order to utilize the GPU so the training will likely take much longer to complete.

Can anyone comment as to what I may have done wrong? Thanks in advance for any comments or suggestions.

monocongo avatar Aug 11 '19 23:08 monocongo

I've tried to work around this by modifying the model creation section of the train.py script (here) from this

    if multi_gpu > 1:
        with tf.device('/cpu:0'):

to this

    if multi_gpu >= 1:
        with tf.device('/device:GPU:0'):

but it resulted in this error:

   if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

It appears that I've come across a known bug in TensorFlow 1.14, and maybe that's why this code appears to be using CPU even when there's a list of GPUs in the configuration JSON file (as a workaround for that bug)? I may be misreading things...

monocongo avatar Aug 12 '19 03:08 monocongo

Hi, @monocongo! I've faced the same problem. As I can see your current driver version is 430. Try 410. It helped me. Good luck!

ivankunyankin avatar Aug 13 '19 19:08 ivankunyankin

Uninstall tensorflow and install tensorflow-gpu does maybe help?

andreasmarxer avatar Sep 26 '19 15:09 andreasmarxer

@monocongo Have you solved the problem? I met it, too

ydteng avatar Oct 16 '19 14:10 ydteng

Tensorflow 1.14 is not tested with CUDA 10.1, so I installed CUDA 10.0, maybe it solves your problems too. Here is what I did in case you need it:

First, download the CUDA 10.0 installer from: https://developer.nvidia.com/cuda-10.0-download-archive

Then, install it. Choose the configuration to not override CUDA 10.1 in your system. You can install to the default path (usually /usr/local/cuda-10.0), but do not create symlink to cuda, so CUDA 10.1 will still be the default CUDA version in your system. In bash: sudo sh ./cuda_10.0.130_410.48_linux.run

Make sure to have tensorflow 1.14. In python:

from tensorflow import tf
tf.__version__

Before running the script for training, you must change LD_LIBRARY_PATH with the following bash command:

export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64

This only changes LD_LIBRARY_PATH on the current console!

You are ready to go. If you want the script to automatically change LD_LIBRARY_PATH when running, maybe you can check the answers in this StackOverflow thread.

hamddan4 avatar Apr 29 '20 09:04 hamddan4

yeah, hell of a situation. went through the same thing recently, just gave up, doesn't seem to be a certain way of fixing this.

PedroMundel avatar Aug 12 '23 23:08 PedroMundel