keras-yolo3
keras-yolo3 copied to clipboard
Training doesn't utilize GPU
I am performing training of the model using a custom dataset on an AWS EC2 instance (p2.xlarge) with an NVIDIA Tesla K80 GPU. After launching the training script I see full CPU utilization but no utilization of the GPU, as measured by the output of $ watch -n0.1 nvidia-smi
.
Sun Aug 11 23:04:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 55C P0 58W / 149W | 67MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4546 C python3 56MiB |
+-----------------------------------------------------------------------------+
The EC2 instance is Ubuntu 18.04 with nvidia-driver-430
installed.
The config.json
file:
{
"model" : {
"min_input_size": 288,
"max_input_size": 448,
"anchors": [0,0, 58,58, 114,193, 116,73, 193,123, 210,270, 303,187, 341,282, 373,367],
"labels": ["handgun"]
},
"train": {
"train_image_folder": "/home/ubuntu/data/yolo3/handgun/images/",
"train_annot_folder": "/home/ubuntu/data/yolo3/handgun/annotations/",
"cache_name": "handgun_train.pkl",
"train_times": 8,
"batch_size": 16,
"learning_rate": 1e-4,
"nb_epochs": 100,
"warmup_epochs": 3,
"ignore_thresh": 0.5,
"gpus": "0",
"grid_scales": [1,1,1],
"obj_scale": 5,
"noobj_scale": 1,
"xywh_scale": 1,
"class_scale": 1,
"tensorboard_dir": "logs",
"saved_weights_name": "handgun.h5",
"debug": true
},
"valid": {
"valid_image_folder": "",
"valid_annot_folder": "",
"cache_name": "",
"valid_times": 1
}
}
The output from the training script looks reasonable and the TensorBoard graphs look good (i.e. continuous drops in the loss graphs). My only concern is that I've not done something correctly in order to utilize the GPU so the training will likely take much longer to complete.
Can anyone comment as to what I may have done wrong? Thanks in advance for any comments or suggestions.
I've tried to work around this by modifying the model creation section of the train.py
script (here) from this
if multi_gpu > 1:
with tf.device('/cpu:0'):
to this
if multi_gpu >= 1:
with tf.device('/device:GPU:0'):
but it resulted in this error:
if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
explicitly_on_cpu = _is_current_explicit_device('CPU')
File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
device = _get_current_tf_device()
File "/home/ubuntu/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
g._apply_device_functions(op)
File "/home/ubuntu/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
It appears that I've come across a known bug in TensorFlow 1.14, and maybe that's why this code appears to be using CPU even when there's a list of GPUs in the configuration JSON file (as a workaround for that bug)? I may be misreading things...
Hi, @monocongo! I've faced the same problem. As I can see your current driver version is 430. Try 410. It helped me. Good luck!
Uninstall tensorflow and install tensorflow-gpu does maybe help?
@monocongo Have you solved the problem? I met it, too
Tensorflow 1.14 is not tested with CUDA 10.1, so I installed CUDA 10.0, maybe it solves your problems too. Here is what I did in case you need it:
First, download the CUDA 10.0 installer from: https://developer.nvidia.com/cuda-10.0-download-archive
Then, install it. Choose the configuration to not override CUDA 10.1 in your system. You can install to the default path (usually /usr/local/cuda-10.0
), but do not create symlink to cuda, so CUDA 10.1 will still be the default CUDA version in your system. In bash:
sudo sh ./cuda_10.0.130_410.48_linux.run
Make sure to have tensorflow 1.14. In python:
from tensorflow import tf
tf.__version__
Before running the script for training, you must change LD_LIBRARY_PATH with the following bash command:
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
This only changes LD_LIBRARY_PATH on the current console!
You are ready to go. If you want the script to automatically change LD_LIBRARY_PATH when running, maybe you can check the answers in this StackOverflow thread.
yeah, hell of a situation. went through the same thing recently, just gave up, doesn't seem to be a certain way of fixing this.