xlnet icon indicating copy to clipboard operation
xlnet copied to clipboard

ValueError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].

Open vanh17 opened this issue 5 years ago • 4 comments

Hi I ran the code in train_gpu.py and it gives me this error:

0717 05:53:04.857903 139941541631808 train_gpu.py:319] n_token 32000
I0717 05:53:04.858911 139941541631808 data_utils.py:795] Use the following tfrecord dirs: ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords']
I0717 05:53:04.859261 139941541631808 data_utils.py:799] [0] Record glob: /net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/record_info-train-*.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.json
I0717 05:53:04.859811 139941541631808 data_utils.py:803] [0] Num of record info path: 1
I0717 05:53:04.860336 139941541631808 data_utils.py:836] [Dir 0] Number of chosen batches: 692053
I0717 05:53:04.860396 139941541631808 data_utils.py:838] [Dir 0] Number of chosen files: 1
I0717 05:53:04.860434 139941541631808 data_utils.py:839] ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/train-0-0.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.tfrecords']
I0717 05:53:04.860471 139941541631808 data_utils.py:846] Total number of batches: 692053
I0717 05:53:04.860699 139941541631808 data_utils.py:848] Total number of files: 1
I0717 05:53:04.860738 139941541631808 data_utils.py:849] ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/train-0-0.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.tfrecords']
I0717 05:53:04.860776 139941541631808 train_gpu.py:204] num of batches 692053
I0717 05:53:04.860822 139941541631808 data_utils.py:555] Host 0 handles 1 files
I0717 05:53:05.006074 139941541631808 data_utils.py:744] label: Tensor("Cast_6:0", shape=(1,), dtype=int32)
I0717 05:53:05.006270 139941541631808 data_utils.py:744] seg_id: Tensor("Cast_7:0", shape=(512,), dtype=int32)
I0717 05:53:05.006346 139941541631808 data_utils.py:744] target_mapping: Tensor("Reshape_4:0", shape=(100, 512), dtype=float32)
I0717 05:53:05.006417 139941541631808 data_utils.py:744] target: Tensor("Cast_8:0", shape=(100,), dtype=int32)
I0717 05:53:05.006478 139941541631808 data_utils.py:744] target_mask: Tensor("Reshape_6:0", shape=(100,), dtype=float32)
I0717 05:53:05.006537 139941541631808 data_utils.py:744] perm_mask: Tensor("Reshape_7:0", shape=(512, 512), dtype=float32)
I0717 05:53:05.006598 139941541631808 data_utils.py:744] input_k: Tensor("Cast_9:0", shape=(512,), dtype=int32)
I0717 05:53:05.006661 139941541631808 data_utils.py:744] input_q: Tensor("Reshape_9:0", shape=(512,), dtype=float32)
I0717 05:53:05.063619 139941541631808 modeling.py:454] memory input [<tf.Tensor 'Placeholder:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_1:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_2:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_3:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_4:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_5:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_6:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_7:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_8:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_9:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_10:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_11:0' shape=(384, 1, 768) dtype=float32>]
I0717 05:53:05.063748 139941541631808 modeling.py:456] Use float type <dtype: 'float32'>
W0717 05:53:05.069464 139941541631808 deprecation.py:323] From /net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
W0717 05:53:05.121577 139941541631808 deprecation.py:323] From /net/vaosl01/opt/NFS/hv4/xlnet-pretrain-mimic/xlnet/modeling.py:536: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W0717 05:53:05.122740 139941541631808 deprecation.py:506] From /net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
I0717 05:53:05.153150 139941541631808 modeling.py:236] *******************************************batch size before assertion fail Tensor("model/transformer/strided_slice:0", shape=(), dtype=int32, device=/gpu:0)**************************************************
Traceback (most recent call last):
  File "/net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1659, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].

vanh17 avatar Jul 17 '19 09:07 vanh17

Hi,

I still have this error when I run the pretraining. Is that possible that I miss out something? Thank you!

vanh17 avatar Jul 22 '19 09:07 vanh17

@vanh17 I have this error when I run the pretrain xlnet-base from scracth on other language and change a parameter in config file(I forget which one).

lsq357 avatar Jul 31 '19 08:07 lsq357

@lsq357 @vanh17 Hi, I have same issue on other language and I have very hard time to find solutions.
@lsq357 please let me know which one file you changed. If you forgot, just put some files you saw.

And also, If I would found causes , I will put them on this thread.

Hope you leave some message here.

sayduke avatar Aug 08 '19 07:08 sayduke

@vanh17 Hi,

I had the same issue. The commands I used to create tfrecord and train were as follows.

python data_utils.py \
  --bsz_per_host=8 \
  --num_core_per_host=4 \
  --seq_len=512 \
  --reuse_len=256 \
  --input_glob=../preprocessed_thaiwikitext/*.txt \
  --save_dir=../tf_record_out \
  --num_passes=20 \
  --bi_data=True \
  --sp_path=../thwiki.model \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85

for making tfrecord and

python train_gpu.py \
  --record_info_dir=../tf_record_out/tfrecords \
  --model_dir=../model_dir \
  --uncased=True \
  --train_batch_size=8 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=384 \
  --perm_size=256 \
  --n_layer=6 \
  --d_model=768 \
  --d_embed=768 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=2048 \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85 \

for training. I think it may have something to do with the batch size being too small or something, because of the op: 'BatchMatMul' in the error. So I tried making the batch size bigger and compensate by lowering seq_len to fit on the Colab GPU. So now I have

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=8 \
	--seq_len=128 \
	--reuse_len=64 \
	--input_glob=../preprocessed_thaiwikitext/*.txt \
	--save_dir=../tf_record_out \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=../thaiwiki.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85

for making tfrecord and

python train_gpu.py \
  --record_info_dir=../tf_record_out/tfrecords \
  --model_dir=../model_dir \
  --uncased=True \
  --train_batch_size=32 \
  --seq_len=128 \
  --reuse_len=64 \
  --mem_len=384 \
  --perm_size=64 \
  --n_layer=6 \
  --d_model=768 \
  --d_embed=768 \
  --n_head=8 \
  --d_head=64 \
  --d_inner=2048 \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85 \
  --save_steps=5000

for training. Now it's been training for 30 mins without any errors.

The tensorflow version is 1.13.1.

sumethy avatar Sep 16 '19 16:09 sumethy