BERT-keras
BERT-keras copied to clipboard
error of sparse_categorical_crossentropy when using theano backend
It's totally no problem when using tensorflow backend. Now I test the theano. When running train_model of tutorial.ipynb,we get 1d~2d tensor but not Tensortype(float32,3D) error from T.nnet.softmax() of K.sparse_categorical_crossentropy
<ipython-input-22-27837df85ad1> in classification_loss(y_true, y_pred)
2 import keras.backend as K
3 def classification_loss(y_true, y_pred):
----> 4 return K.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)
5 train.classification_loss = classification_loss
/usr/local/lib/python3.6/dist-packages/keras/backend/theano_backend.py in sparse_categorical_crossentropy(target, output, from_logits, axis)
1788 target = T.extra_ops.to_one_hot(target, nb_class=output.shape[-1])
1789 target = reshape(target, shape(output))
-> 1790 return categorical_crossentropy(target, output, from_logits, axis=-1)
1791
1792
/usr/local/lib/python3.6/dist-packages/keras/backend/theano_backend.py in categorical_crossentropy(target, output, from_logits, axis)
1762 target = permute_dimensions(target, permutation)
1763 if from_logits:
-> 1764 output = T.nnet.softmax(output)
1765 else:
1766 # scale preds so that the class probas of each sample sum to 1
/usr/local/lib/python3.6/dist-packages/theano/tensor/nnet/nnet.py in softmax(c)
813 if c.broadcastable[-1]:
814 warnings.warn("The softmax is applied on a dimension of shape 1, which does not have a semantic meaning.")
--> 815 return softmax_op(c)
816
817
/usr/local/lib/python3.6/dist-packages/theano/gof/op.py in __call__(self, *inputs, **kwargs)
613 """
614 return_list = kwargs.pop('return_list', False)
--> 615 node = self.make_node(*inputs, **kwargs)
616
617 if config.compute_test_value != 'off':
/usr/local/lib/python3.6/dist-packages/theano/tensor/nnet/nnet.py in make_node(self, x)
428 or x.type.dtype not in tensor.float_dtypes:
429 raise ValueError('x must be 1-d or 2-d tensor of floats. Got %s' %
--> 430 x.type)
431 if x.ndim == 1:
432 warnings.warn("DEPRECATION: If x is a vector, Softmax will not automatically pad x "
ValueError: x must be 1-d or 2-d tensor of floats. Got TensorType(float32, 3D)
Then I use this to avoid it:
import keras.backend as K
_softmax = K.T.nnet.softmax
def softmax(x):
if x.ndim == 3:
d1,d2,d3 = x.shape
return _softmax(x.reshape((d1*d2,d3))).reshape((d1,d2,d3))
return _softmax(x)
K.T.nnet.softmax = softmax
but run
m = train_model(base_model=sequence_encoder, is_causal=False, tasks_meta_data=tasks, pretrain_generator=generator,
finetune_generator=generator, pretrain_epochs=100, pretrain_steps=number_of_pretrain_steps // 100,
finetune_epochs=100, finetune_steps=number_of_finetune_steps // 100, verbose=2, TPUStrategy=strategy)
again we get error:
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer lm_logits and cannot be automatically inferred with the Theano backend. Defaulting to output shape `(None, 6)` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer lm_loss and cannot be automatically inferred with the Theano backend. Defaulting to output shape `[(None, 1), (None, 8), (None, 8, 6), (None, 8)]` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer odd_flatten and cannot be automatically inferred with the Theano backend. Defaulting to output shape `(None, 8, 6)` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer odd_gather and cannot be automatically inferred with the Theano backend. Defaulting to output shape `[(None, 8, 6), (None, 1)]` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer odd_loss and cannot be automatically inferred with the Theano backend. Defaulting to output shape `[(None, 1), (None, 1), (None, 8, 2)]` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
/usr/local/lib/python3.6/dist-packages/keras/layers/core.py:665: UserWarning: `output_shape` argument not specified for layer lm_random_loss and cannot be automatically inferred with the Theano backend. Defaulting to output shape `[(None, 1), (None, 8), (None, 8, 25), (None, 8)]` (same as input shape). If the expected output shape is different, specify it via the `output_shape` argument.
.format(self.name, input_shape))
Epoch 1/100
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
902 outputs =\
--> 903 self.fn() if output_subset is None else\
904 self.fn(output_subset=output_subset)
/usr/local/lib/python3.6/dist-packages/theano/gof/op.py in rval(p, i, o, n)
891 def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 892 r = p(n, [x[0] for x in i], o)
893 for o in node.outputs:
/usr/local/lib/python3.6/dist-packages/theano/tensor/subtensor.py in perform(self, node, inputs, out_)
2338 if self.set_instead_of_inc:
-> 2339 out[0][inputs[2:]] = inputs[1]
2340 else:
IndexError: index 8 is out of bounds for axis 1 with size 6
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-39-7b7276d2ce06> in <module>()
1 m = train_model(base_model=sequence_encoder, is_causal=False, tasks_meta_data=tasks, pretrain_generator=generator,
2 finetune_generator=generator, pretrain_epochs=100, pretrain_steps=number_of_pretrain_steps // 100,
----> 3 finetune_epochs=100, finetune_steps=number_of_finetune_steps // 100, verbose=2, TPUStrategy=strategy)
4 # now m is ready to be used!
5 print(m.inputs)
/content/bert_keras_repo/transformer/train.py in train_model(base_model, is_causal, tasks_meta_data, pretrain_generator, finetune_generator, pretrain_epochs, pretrain_optimizer, pretrain_steps, pretrain_callbacks, finetune_epochs, finetune_optimizer, finetune_steps, finetune_callbacks, verbose, TPUStrategy)
145
146 if pretrain_generator is not None:
--> 147 train_step(True)
148 if finetune_generator is not None:
149 train_step(False)
/content/bert_keras_repo/transformer/train.py in train_step(is_pretrain)
142 _model.fit_generator(_generator, steps_per_epoch=pretrain_steps if is_pretrain else finetune_steps,
143 verbose=verbose, callbacks=pretrain_callbacks if is_pretrain else finetune_callbacks,
--> 144 shuffle=False, epochs=pretrain_epochs if is_pretrain else finetune_epochs)
145
146 if pretrain_generator is not None:
/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your `' + object_name + '` call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper
/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1416 use_multiprocessing=use_multiprocessing,
1417 shuffle=shuffle,
-> 1418 initial_epoch=initial_epoch)
1419
1420 @interfaces.legacy_generator_methods_support
/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
215 outs = model.train_on_batch(x, y,
216 sample_weight=sample_weight,
--> 217 class_weight=class_weight)
218
219 outs = to_list(outs)
/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1215 ins = x + y + sample_weights
1216 self._make_train_function()
-> 1217 outputs = self.train_function(ins)
1218 return unpack_singleton(outputs)
1219
/usr/local/lib/python3.6/dist-packages/keras/backend/theano_backend.py in __call__(self, inputs)
1386 def __call__(self, inputs):
1387 assert isinstance(inputs, (list, tuple))
-> 1388 return self.function(*inputs)
1389
1390
/usr/local/lib/python3.6/dist-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
915 node=self.fn.nodes[self.fn.position_of_error],
916 thunk=thunk,
--> 917 storage_map=getattr(self.fn, 'storage_map', None))
918 else:
919 # old-style linkers raise their own exceptions
/usr/local/lib/python3.6/dist-packages/theano/gof/link.py in raise_with_op(node, thunk, exc_info, storage_map)
323 # extra long error message in that case.
324 pass
--> 325 reraise(exc_type, exc_value, exc_trace)
326
327
/usr/local/lib/python3.6/dist-packages/six.py in reraise(tp, value, tb)
690 value = tp()
691 if value.__traceback__ is not tb:
--> 692 raise value.with_traceback(tb)
693 raise value
694 finally:
/usr/local/lib/python3.6/dist-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
901 try:
902 outputs =\
--> 903 self.fn() if output_subset is None else\
904 self.fn(output_subset=output_subset)
905 except Exception:
/usr/local/lib/python3.6/dist-packages/theano/gof/op.py in rval(p, i, o, n)
890 # default arguments are stored in the closure of `rval`
891 def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 892 r = p(n, [x[0] for x in i], o)
893 for o in node.outputs:
894 compute_map[o][0] = True
/usr/local/lib/python3.6/dist-packages/theano/tensor/subtensor.py in perform(self, node, inputs, out_)
2337
2338 if self.set_instead_of_inc:
-> 2339 out[0][inputs[2:]] = inputs[1]
2340 else:
2341 np.add.at(out[0], tuple(inputs[2:]), inputs[1])
IndexError: index 8 is out of bounds for axis 1 with size 6
Apply node that caused the error: AdvancedIncSubtensor{inplace=False, set_instead_of_inc=True}(Alloc.0, TensorConstant{1}, ARange{dtype='int64'}.0, Reshape{1}.0)
Toposort index: 315
Inputs types: [TensorType(float32, matrix), TensorType(int8, scalar), TensorType(int64, vector), TensorType(int32, vector)]
Inputs shapes: [(64, 6), (), (64,), (64,)]
Inputs strides: [(24, 4), (), (8,), (4,)]
Inputs values: ['not shown', array(1, dtype=int8), 'not shown', 'not shown']
Outputs clients: [[Reshape{3}(AdvancedIncSubtensor{inplace=False, set_instead_of_inc=True}.0, MakeVector{dtype='int64'}.0)]]
Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
File "bert_keras_repo/transformer/train.py", line 68, in train_model
[task_loss_weight, task_target, logits, task_mask])
File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 457, in __call__
output = self.call(inputs, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/keras/layers/core.py", line 687, in call
return self.function(inputs, **arguments)
File "bert_keras_repo/transformer/train.py", line 67, in <lambda>
task_loss = Lambda(lambda x: x[0] * masked_classification_loss(x[1], x[2], x[3]), name=task.name + '_loss')(
File "bert_keras_repo/transformer/train.py", line 20, in masked_classification_loss
return _mask_loss(y_true, y_pred, y_mask, classification_loss)
File "bert_keras_repo/transformer/train.py", line 11, in _mask_loss
l = K.switch(y_mask, element_wise_loss(y_true, y_pred), K.zeros_like(y_mask, dtype=K.floatx()))
File "<ipython-input-22-27837df85ad1>", line 4, in classification_loss
return K.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)
File "/usr/local/lib/python3.6/dist-packages/keras/backend/theano_backend.py", line 1788, in sparse_categorical_crossentropy
target = T.extra_ops.to_one_hot(target, nb_class=output.shape[-1])
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
It seems it's not my coding bug because I checkout the branch back to that one is before tpu support.
I got something about this: keras-users/EhWwuq6R0lQ I'm not familiar with theano, so I don't know why it's OK on tensorflow but not okay on theano.
yeah I know; as I said in the readme file I was unable to train the model with theano backend (I also checked cntk, I couldn't even run the model!)
On Fri, Nov 30, 2018 at 3:24 PM hcWu [email protected] wrote:
I got something about this: keras-users/EhWwuq6R0lQ https://groups.google.com/forum/#!topic/keras-users/EhWwuq6R0lQ I'm not familiar with theano, so I don't know why it's OK on tensorflow but not okay on theano.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Separius/BERT-keras/issues/7#issuecomment-443181761, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfsCdN9W_1R3ghY50Z2Xlqu-WA6zkqfks5u0RxtgaJpZM4Y7h79 .
Oh, I see it. Maybe the theano support is not very necessary. At least now we rarely use theano. I should have seen it. It seems that I have donesome useless work.I should spend my time on something else. Will you spend your time on this ?
TBH I spent a day on this and at the end, I just hated Keras (for allowing such issues) and my self! so no I'm not going to waste any more time on this; Right now I'm changing the attention mechanism of BERT and trying to make it faster
If you want to play with BERT and learn something (and help others) a good direction is to train a distilled version of BERT, so maybe you can train a model that is only 8 layers deep and 16 heads per layer but with similar accuracy another idea that you can try is to use an encoder other than the transformer, so maybe a multilayer bidirectional QRNN can be used instead of the transformer?
Oh and thanks for making sure that the TPU version is correct and checking the backward compatibility :+1:
Thanks for your advice. BERT is really so large one for me. I will try your suggestion and wish you success on your new try.