keras-yolo2
keras-yolo2 copied to clipboard
Multi GPU training
I tried this code on CPU and on a single GPU and it works fine. I tried a previous version on 4 GPUs and it worked fine too. I will be able to try this version on multiple GPUs on Monday. Let me know what you think.
This is great! Will let you know if I am able to run the code.
@alessandro-montanari I find that the multiple GPU version produces worse result compared to the single GPU version. It makes a lot of wrong detections. Do I have to take any note when running this multiple GPU version?
That's weird. What's your batch size? Maybe you need to train for longer because with a bigger batch size there are less updates? I am trying it on the raccoon dataset.
Unfortunately I am having some weird issues with the images where the code fails in preprocessing.py line 238 (h, w, c = image.shape
) with ValueError: not enough values to unpack (expected 3, got 2)
. This is not due to the code but it's because I am running it on a cluster where I can test multiple GPUs but I always had some strange problems with jpeg files. I also tried with the master branch and it's the same.
Anyway, with the code we are using for our application (it's basically this one plus some other changes to your implementation) we didn't see any loss in accuracy going from 1 GPU (batch size = 40) to 4 GPUs (batch size = 160). Is your validation loss very different from the single GPU version of the code? Do you evaluate the model immediately after training or you load again the weights?
I'll try to come back on this but please let me know if you have any news.
Have anyone been able to train with more than one GPU? Over here, at the end of the first epoch, keras crashes when trying to save the model.
@msis what error do you get?
@alessandro-montanari Here's the trace with python2
:
Traceback (most recent call last):
File "train.py", line 144, in <module>
_main_(args)
File "train.py", line 140, in _main_
debug = config['train']['debug'])
File "/home/ubuntu/dl/basic-yolo-keras/frontend.py", line 478, in train
max_queue_size = 8)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/training.py", line 2213, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 76, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/callbacks.py", line 418, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2573, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/models.py", line 111, in save_model
'config': model.get_config()
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/keras/engine/topology.py", line 2414, in get_config
return copy.deepcopy(config)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
y.append(deepcopy(a, memo))
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
y.append(deepcopy(a, memo))
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/ubuntu/anaconda3/envs/tensorflow_p27/lib/python2.7/copy.py", line 182, in deepcopy
rv = reductor(2)
TypeError: can't pickle NotImplementedType objects
and in python3
:
Traceback (most recent call last):
File "train.py", line 144, in <module>
_main_(args)
File "train.py", line 140, in _main_
debug = config['train']['debug'])
File "/home/smr/tmp/basic-yolo-keras/frontend.py", line 478, in train
max_queue_size = 8)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", line 2117, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 73, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/callbacks.py", line 414, in on_epoch_end
self.model.save(filepath, overwrite=True)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2556, in save
save_model(self, filepath, overwrite, include_optimizer)
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/models.py", line 107, in save_model
'config': model.get_config()
File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/engine/topology.py", line 2397, in get_config
return copy.deepcopy(config)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 218, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 223, in <listcomp>
y = [deepcopy(a, memo) for a in x]
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 297, in _reconstruct
state = deepcopy(state, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 155, in deepcopy
y = copier(x, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 243, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 182, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/anaconda/envs/py35/lib/python3.5/copy.py", line 306, in _reconstruct
y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x7f40ec
384e48>>
Traceback (most recent call last):
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 696, in __del__
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init_
_
TypeError: 'NoneType' object is not callable
N.B. I used 2to3
to use the project with python3
have a issue for multi-gpu training. i did trained own dataset(6,000 images) using multi-gpu code of you. but i got a 0.00 mAP result using multi-gpu while evaluate. (single-gpu = 0.3 / multi-gpu = 0.00)
different configuration is only batch size. (multi-gpu: 64, single-gpu: 16) what's problem?