keras-yolo3 icon indicating copy to clipboard operation
keras-yolo3 copied to clipboard

Conv implementation does not support grouped convolutions for now. GPU / CPU

Open moritzbe opened this issue 5 years ago • 10 comments

Hi, I am trying to train locally on my Mac without a GPU. Thank you Experiencor, I can train successfully on the raccoon dataset.

This is my config.json file:

`{ "model" : { "min_input_size": 1000, "max_input_size": 1920, "anchors": [21,68, 27,83, 63,64, 77,77, 96,87, 111,101, 161,98, 178,131, 209,110], "labels": ["1", "2", "3", "4"] },

"train": {
    "train_image_folder":   "/Users/moritz.b/Desktop/keras-yolo3-master/yolo/",
    "train_annot_folder":   "/Users/moritz.b/Desktop/keras-yolo3-master/yolo/outputs/",
    "cache_name":           "localizer.pkl",
    "train_times":          6,
    "batch_size":           2,
    "learning_rate":        1e-4,
    "nb_epochs":            10,
    "warmup_epochs":        2,
    "ignore_thresh":        0.5,
    "gpus":                 "0",

    "grid_scales":          [1,1,1],
    "obj_scale":            5,
    "noobj_scale":          1,
    "xywh_scale":           1,
    "class_scale":          1,

    "tensorboard_dir":      "logs",
    "saved_weights_name":   "localisations.h5",
    "debug":                true
},

"valid": {
    "valid_image_folder":   "",
    "valid_annot_folder":   "",
    "cache_name":           "",
    "valid_times":          1
}

}`

When I start training, the error message says:

Traceback (most recent call last): File "train.py", line 290, in <module> _main_(args) File "train.py", line 267, in _main_ max_queue_size = 8 File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/keras/engine/training.py", line 1732, in fit_generator initial_epoch=initial_epoch) File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/keras/engine/training_generator.py", line 220, in fit_generator reset_metrics=False) File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/keras/engine/training.py", line 1514, in train_on_batch outputs = self.train_function(ins) File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__ run_metadata=self.run_metadata) File "/Users/moritz.b/opt/anaconda3/envs/yolo3/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1472, in __call__ run_metadata_ptr) tensorflow.python.framework.errors_impl.UnimplementedError: Fused conv implementation does not support grouped convolutions for now. [[{{node conv_81/BiasAdd}}]]

Does anybody have an idea?

moritzbe avatar Feb 18 '20 20:02 moritzbe

Im getting the same error when training with zoo/config_rbc.json

moritzbe avatar Feb 18 '20 21:02 moritzbe

On Google Colab it works fine with the same environment, It seems to be a hardware related issue.

moritzbe avatar Mar 19 '20 07:03 moritzbe

I have replicated your errors on Ubuntu18.04 on a GTX1660. According to this thread, grouped convolutions may not be supported on CPU. I don't know if that is still the case for 1.15, but I am getting your errors when I switch to CPU. You might try the suggestion on this issue. https://github.com/tensorflow/tensorflow/issues/29005#issuecomment-569554164

However when I use GPU I'm still getting errors at that same layer: tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: No algorithm worked! [[{{node conv_81/convolution}}]] [[loss/Identity_1/_3877]] (1) Not found: No algorithm worked! [[{{node conv_81/convolution}}]]

Would you mind sharing the google colab that worked?

aaronrmm avatar Mar 19 '20 20:03 aaronrmm

Having same issues on Colab and on CPU. It errors on multiple classes. Output on Colab - Screen Shot 2020-03-29 at 3 13 17 PM

pushkarjain avatar Mar 29 '20 20:03 pushkarjain

On Google Colab it works fine with the same environment, It seems to be a hardware related issue.

I'm having the same issue. I tried to run this model about 'train.py', but it errors on multiple classes. How can you solve this problem?

YoonSungLee avatar May 22 '20 04:05 YoonSungLee

Having same issues on Colab and on CPU. It errors on multiple classes. Output on Colab - Screen Shot 2020-03-29 at 3 13 17 PM

I'm having the same issue, too. Did you solve this problem?

YoonSungLee avatar May 22 '20 04:05 YoonSungLee

@YoonSungLee: Yes. I was able to solve the problem. I had to use a different cache_name for the data in config.json. This error occurs because you have updated the output layer to accommodate new classes, but the pickle file created uses old class list.

Hope that helps.

pushkarjain avatar May 22 '20 04:05 pushkarjain

@YoonSungLee: Yes. I was able to solve the problem. I had to use a different cache_name for the data in config.json. This error occurs because you have updated the output layer to accommodate new classes, but the pickle file created uses old class list.

Hope that helps.

Wow, thank you very much! By doing so, I'm able to get over that error. But I have another error, now... Do you happen to know how to solve this problem?

The error is as follows:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /content/gdrive/My Drive/Project/YOLO v3/keras-yolo3/yolo.py:26: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. WARNING:tensorflow:From /content/gdrive/My Drive/Project/YOLO v3/keras-yolo3/yolo.py:151: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

Loading pretrained weights.

Traceback (most recent call last): File "train.py", line 295, in main(args) File "train.py", line 257, in main class_scale = config['train']['class_scale'], File "train.py", line 167, in create_model template_model.load_weights(saved_weights_name) File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 492, in load_wrapper return load_function(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/network.py", line 1230, in load_weights f, self.layers, reshape=reshape) File "/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py", line 1200, in load_weights_from_hdf5_group g = f[name] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 264, in getitem oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: 'Unable to open object (bad symbol table node signature)'

YoonSungLee avatar May 22 '20 06:05 YoonSungLee

@YoonSungLee Do you have backend.h5 in the same location of config.json? It is pretrained weights for the model.

pushkarjain avatar May 22 '20 06:05 pushkarjain

@YoonSungLee Do you have backend.h5 in the same location of config.json? It is pretrained weights for the model.

Wow, by changing the h5 file, I'm able to get over the error. But I have another error, again. I'm very exhausted. Could you help me? The error is as following:

Loading pretrained weights.

/usr/local/lib/python3.6/dist-packages/keras/callbacks/callbacks.py:998: UserWarning: epsilon argument is deprecated and will be removed, use min_delta instead. warnings.warn('epsilon argument is deprecated and ' WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:431: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:438: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v1.py:200: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v1.py:203: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Epoch 1/100 src/tcmalloc.cc:283] Attempt to free invalid pointer 0x696e69617274160a

YoonSungLee avatar May 22 '20 06:05 YoonSungLee