keras-extras icon indicating copy to clipboard operation
keras-extras copied to clipboard

Incompatible shapes

Open smhoang opened this issue 8 years ago • 40 comments

I am running make_parallel with 2 GPUs, the error occurred with gradients/sub_grad/BroadcastGradientArgs: "InvalidArgumentError (see above for traceback): Incompatible shapes: [483,1] vs. [482,1] [[Node: gradients/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@sub"], _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/sub_grad/Shape, gradients/sub_grad/Shape_1/_79)]]"

smhoang avatar Feb 08 '17 17:02 smhoang

I get the exact same error. Would appreciate some help on this.

asaluja avatar Feb 21 '17 21:02 asaluja

i get similar error. i guess it is due to the last minibatch has an odd number of samples, however the paralleled model only produced even number of predictions

xulabs avatar Feb 28 '17 01:02 xulabs

Did you hardcode the batch size in your first layer input (batch_input_shape), or give input_dim ?

Caduceus96 avatar Mar 08 '17 14:03 Caduceus96

@Caduceus96 just gave input_dim. Batch size is hardcoded when I call fit

asaluja avatar Mar 09 '17 00:03 asaluja

The same error here! Running Keras 2.0.2 with Tensorflow 0.12.1

InvalidArgumentError (see above for traceback): Incompatible shapes: [6376,256] vs. [6379,256]
         [[Node: gradients/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _class=["loc:@sub"], _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/sub_grad/Shape/_459, gradients/sub_grad/Shape_1)]]
         [[Node: gradients/concatenate_1/concat_grad/Slice_7/_491 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:7", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3913_gradients/concatenate_1/concat_grad/Slice_7", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:7"]()]]

ktamiola avatar Mar 24 '17 01:03 ktamiola

It might be related to the function get_slice. I found out that if the number of input data is a multiple of your batch size, then there is no such error

Eric2333 avatar Apr 03 '17 00:04 Eric2333

OK, I'm probably wrong. The error seems to come from my callback function. If I don't do callbacks, everything is fine no matter how many rows of input data.

Eric2333 avatar Apr 03 '17 20:04 Eric2333

I actually see this error when I try to run the example in the website.

miguelroboso avatar Apr 04 '17 23:04 miguelroboso

Same as @Eric2333 , don't use callbacks or change them to lambda functions and it works fine.

sumethy avatar Apr 21 '17 12:04 sumethy

Also ran in to this error with Keras 2.0.3 and TensorFlow 1.1.0 It happens at the end of the first epoch of training. Possibly in calculating validation. (I do use callbacks for checkpoint and early stopping).. will try without.

73997312/73997516 [============================>.] - ETA: 0s - loss: 12.1832/home/ubuntu/devhome/tensorwords2/multi_gpu.py:45: UserWarning: The merge function is deprecated and will be removed after 08/2017. Use instead layers from keras.layers.merge, e.g. add, concatenate, etc. merged.append(merge(outputs, mode='concat', concat_axis=0)) /home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/legacy/layers.py:460: UserWarning: The Merge layer is deprecated and will be removed after 08/2017. Use instead layers from keras.layers.merge, e.g. add, concatenate, etc. name=name) /home/ubuntu/devhome/tensorwords2/multi_gpu.py:47: UserWarning: Update your Model call to the Keras 2 API: Model(inputs=[<tf.Tenso..., outputs=[<tf.Tenso...) return Model(input=model.inputs, output=merged) Traceback (most recent call last): File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1039, in _do_call return fn(*args) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1021, in _run_fn status, run_metadata) File "/home/ubuntu/.pyenv/versions/3.6.1/lib/python3.6/contextlib.py", line 89, in exit next(self.gen) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [204,34] vs. [200,34] [[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_merge_1_target_0/_9, Log)]] [[Node: gradients/merge_1/concat_grad/Slice_3/_529 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:3", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_12702_gradients/merge_1/concat_grad/Slice_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:3"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./TextGenLearn3.py", line 293, in main() File "./TextGenLearn3.py", line 290, in main prep.gofit(model,(inputTrain,responseTrain),(inputValid,responseValid), args.output, args.epoch, args.patience, batchSize) File "./TextGenLearn3.py", line 174, in gofit initial_epoch=nextEpoch) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/engine/training.py", line 1486, in fit initial_epoch=initial_epoch) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/engine/training.py", line 1141, in _fit_loop outs = f(ins_batch) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2103, in call feed_dict=feed_dict) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 778, in run run_metadata_ptr) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 982, in _run feed_dict_string, options, run_metadata) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run target_list, options, run_metadata) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [204,34] vs. [200,34] [[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_merge_1_target_0/_9, Log)]] [[Node: gradients/merge_1/concat_grad/Slice_3/_529 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:3", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_12702_gradients/merge_1/concat_grad/Slice_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:3"]]

Caused by op 'mul', defined at: File "./TextGenLearn3.py", line 293, in main() File "./TextGenLearn3.py", line 280, in main model = prep.createModel(args.seqlen,numChars,args.lstmsize,args.numlayers,args.dropout,args.learnrate, args.parallel) File "./TextGenLearn3.py", line 151, in createModel optimizer=optimizer) # Categorical since we are 1-hot categorical. File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/engine/training.py", line 899, in compile sample_weight, mask) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/engine/training.py", line 430, in weighted score_array = fn(y_true, y_pred) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/losses.py", line 37, in categorical_crossentropy return K.categorical_crossentropy(y_pred, y_true) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2582, in categorical_crossentropy return - tf.reduce_sum(target * tf.log(output), File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 821, in binary_op_wrapper return func(x, y, name=name) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1044, in _mul_dispatch return gen_math_ops._mul(x, y, name=name) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1434, in _mul result = _op_def_lib.apply_op("Mul", x=x, y=y, name=name) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op op_def=op_def) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/ubuntu/.pyenv/versions/tensor/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1228, in init self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [204,34] vs. [200,34] [[Node: mul = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_merge_1_target_0/_9, Log)]] [[Node: gradients/merge_1/concat_grad/Slice_3/_529 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:3", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_12702_gradients/merge_1/concat_grad/Slice_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:3"]]

jgustave avatar May 14 '17 15:05 jgustave

The number of samples just needs to be a mutiple of the total number of GPUs. Ex. I had 68531 samples in in my input, and once I shaved that down to 68528 with 8 GPUs, it worked fine.

jwilt1 avatar May 30 '17 15:05 jwilt1

@jwilt1 Thanks!! Your example is nice work. I modified my code, the input sample size must be n_gpu times.

vense avatar Jun 05 '17 01:06 vense

If you have large training set it's not an issue and you can always cut it like:

train_cut = len(train_index)%GPUs train_index = train_index[:-train_cut]

And it works fine. But after training I have issue with predictions, it have to be multiple by GPUs as well. Any ideas?

szhitansky avatar Jun 25 '17 07:06 szhitansky

You can use the same kind of trick as for training, but instead of removing the last remainder elements you pad the end of your dataset to make it divisible by # of gpus, then select the unpadded indices as your actual prediction.

Sent from my iPhone

On Jun 25, 2017, at 3:15 AM, Sergey Zhitansky [email protected] wrote:

If you have large training set it's not an issue and you can always cut it like:

train_cut = len(train_index)%GPUs train_index = train_index[:-train_cut]

And it works fine. But after training I have issue with predictions, it have to be multiple by GPUs as well. Any ideas?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Caduceus96 avatar Jun 25 '17 17:06 Caduceus96

@Caduceus96 I sliced my training data into multiples of gpus, the first epoch runs well, but when it comes to the second epoch, error raises 3792/3800 [============================>.] - ETA: 0s - loss: 11.5726 - mean_squared_error: 1.9049Traceback (most recent call last):

......

InvalidArgumentError (see above for traceback): Incompatible shapes: [12,3] vs. [14,3] [[Node: sub = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](concatenate_2/concat/_851, _recv_concatenate_2_target_0/_853)]] [[Node: add_3/_857 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_3571_add_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

train_shape=[(3800, none, 1)] * 10, valid_shape=[(254, none, 1)] * 10, corresponding to train_shape, num_gpu = 4, train_batch=16

jianglinghan avatar Jul 17 '17 13:07 jianglinghan

Is your training set size evenly divisible by gpu #?

Sent from my iPhone

On Jul 17, 2017, at 9:33 AM, Ling-han Jiang [email protected] wrote:

I sliced my training data into multiples of gpus, the first epoch runs well, but when it comes to the second epoch, error raises InvalidArgumentError (see above for traceback): Incompatible shapes: [12,3] vs. [14,3] [[Node: sub = Sub[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](concatenate_2/concat/_851, _recv_concatenate_2_target_0/_853)]] [[Node: add_3/_857 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_3571_add_3", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Caduceus96 avatar Jul 17 '17 14:07 Caduceus96

@Caduceus96 I guess so, 3800/4 =950.

jianglinghan avatar Jul 17 '17 14:07 jianglinghan

@JiangLing-han it is evident you are using small batch sizes during your training (as the progress bar output from your Keras model.train routine stops at 3792/3800.

You need to make sure your batches are of equal size and divisible by 4.

ktamiola avatar Jul 17 '17 15:07 ktamiola

@ktamiola @Caduceus96 I solved this problem by set size of validation set to multiples of 4. The model was copied, valid data was sliced as well as train data. Many thanks for you. :)

jianglinghan avatar Jul 17 '17 17:07 jianglinghan

If you want to predict just one at a time, instead of a multiple of the GPUs used during training, you can create a 2nd model that is identical and load the weights of your parallelized model.

  1. Create a model named model1
  2. Create model2 by applying the make_parallel fuction to model1
  3. Train model2 with 8 GPUs
  4. Set model1 weights to weights of model2. model.set_weights(model2.get_weights())
  5. Predict however many you want at a time using model1

model1.predict(val[0:10,:,:]) -> success model2.predict(val[0:10,:,:]) -> ValueError: could not broadcast input array from shape (8,2) into shape (10,2)

jwilt1 avatar Aug 02 '17 19:08 jwilt1

Many thanks to your code! I would suggest adding a note at the beginning of the make_parallel function to notify that the size of training/validation data should be divisible by the number of gpus. It would be opaque for a user to see why training is okay but after an epoch an exception of imcompatible shapes is thrown.

DarkForte avatar Aug 22 '17 03:08 DarkForte

Has anyone else faced an error using regularizers? Using Layers like this:

def` conv2d_bn(x, nb_filter, nb_row, nb_col, padding='same', strides=(1, 1), bias=False):

 """
    Utility function to apply conv + BN.
    (Slightly modified from https://github.com/fchollet/keras/blob/master/keras/applications/inception_v3.py)
    """
    if K.image_data_format() == "channels_first":
        channel_axis = 1
    else:
        channel_axis = -1
    x = Convolution2D(nb_filter, (nb_row, nb_col),
                      strides=strides,
                      padding=padding,
                      use_bias=bias,
                      kernel_regularizer=regularizers.l2(0.00004), ##<---- causes error because no _loss 
                      kernel_initializer=initializers.VarianceScaling(scale=2.0, mode='fan_in', distribution='normal',
                                                                      seed=None))(x)
    x = BatchNormalization(axis=channel_axis, momentum=0.9997, scale=False)(x)
    x = Activation('relu')(x)
    return x

I get the error: „AttributeError: 'Model' object has no attribute '_losses'„ caused by outputs = model (inputs) that merges the outputs of the different splits in one model.

CeadeS avatar Aug 22 '17 04:08 CeadeS

batch size : 64
number of batches : 20
number of GPUs: 2
The error I got:
InvalidArgumentError: Incompatible shapes: [64,2] vs. [128,2]
How can I deal with this?

DNXie avatar Feb 25 '18 08:02 DNXie

@DNXie, I am having the same error, the shape[0] gets halfed. Did you find a solution?

A related issue: https://github.com/keras-team/keras/issues/9449

zyxue avatar Apr 05 '18 21:04 zyxue

Same issue here with the latest Keras version.

ghost avatar May 08 '18 05:05 ghost

Hi, was a fix issued for this error? I am facing the same issue. model.fit works for batch size 64 when not using multi GPU. But when I put the same model through multi_gpu_model and call fit on it, it is raising error that 16 and 64 are incompatible shapes.

umashgh avatar Oct 14 '18 08:10 umashgh

I am getting the error tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7600] vs. [400,19] some of the pointers are as follows:

  1. I get this error only when run my code on a GPU node (Tesla k80)
  2. I do not get the error for batch_size = 1
  3. I do not get the error when I do not use metrics=['accuracy'] in compile.
  4. I get the error only for some particular architecture
  5. All the problems reported above have problems with arrays of the same dimensionality [n1,n2] vs [m1,m2] but my case is [n] vs [n/r, r]

full error is as follows: MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) Epoch 1/10 Traceback (most recent call last): File "driver_training.py", line 66, in history = ED.fit_model() File "/home/ubuntu/2018-December/models/commom/v1/seq2seq_trainig.py", line 114, in fit_model callbacks=callback(self.cfg)) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit validation_steps=validation_steps) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop outs = f(ins_batch) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call return self._call(inputs) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(*array_vals) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1382, in call run_metadata_ptr) File "/home/ubuntu/software/tf/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7600] vs. [400,19] [[Node: metrics/acc/Equal = Equal[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](metrics/acc/Reshape, metrics/acc/Cast)]] [[Node: loss/mul/_253 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4325_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

jayanti-prasad avatar Dec 14 '18 07:12 jayanti-prasad

here is full code

import numpy as np from keras.models import Model from keras import optimizers from keras.layers import Input, Dense, Embedding import keras

num_decoder_tokens=40 len_label_vector=20 latent_dim=300

train_labels_vecs = np.random.randint(num_decoder_tokens, size=(100, len_label_vector))

decoder_input_data = train_labels_vecs[:, :-1] decoder_target_data = train_labels_vecs[:, 1:]

decoder_inputs = Input(shape=(None,), name='Decoder-Input') # for teacher forcing x = Embedding(num_decoder_tokens, latent_dim, name='Decoder-Word-Embedding', mask_zero=False)(decoder_inputs) decoder_outputs = Dense(num_decoder_tokens, activation='softmax', name='Final-Output-Dense') (x)

seq2seq_Model = Model([decoder_inputs], decoder_outputs)

print(seq2seq_Model.summary())

seq2seq_Model.compile(optimizer=optimizers.Nadam(lr=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

history = seq2seq_Model.fit([decoder_input_data], np.expand_dims(decoder_target_data, -1),validation_split=0.12,epochs=10,batch_size=2)

jayanti-prasad avatar Dec 14 '18 08:12 jayanti-prasad

@jayanti-prasad

same error and the followings are completely true when i run a seq2seq architecture on a local pc.

  • I do not get the error for batch_size = 1
  • I do not get the error when I do not use metrics=['accuracy'] in compile.

BUT, there is no error when i run the codes on a kaggle kernel with the same tf version1.12.0 and the keras version2.2.4.

davidkorea avatar Feb 12 '19 08:02 davidkorea

I also have a very similar error and changing the batch size and sample size to fit the multiple of GPU doesn't solve the problem. My error is as follows:

InvalidArgumentError: Incompatible shapes: [128,32,32,3] vs. [256,32,32,3]
	 [[{{node replica_1/sequential_1/conv_lst_m2d_1/while/mul_3}} = Mul[T=DT_FLOAT, _class=["loc:@train...rayWriteV3"], _device="/job:localhost/replica:0/task:0/device:GPU:1"](replica_1/sequential_1/conv_lst_m2d_1/while/TensorArrayReadV3, replica_1/sequential_1/conv_lst_m2d_1/while/mul_3/Enter)]]
	 [[{{node loss/mul/_305}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5049_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

This problem only happens when the model has a ConvLSTM2D layer, without it the code runs just fine. As for other properties:

  • I am using 2 GPUs
  • Sample size 2048
  • batch size 256
  • Each of my input sample has shape [21, 32, 32, 1] where 21 is the temporal size, 32 x 32 image, 1 channel

TianrenWang avatar Mar 25 '19 08:03 TianrenWang