xlnet
xlnet copied to clipboard
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].
Hi I ran the code in train_gpu.py and it gives me this error:
0717 05:53:04.857903 139941541631808 train_gpu.py:319] n_token 32000
I0717 05:53:04.858911 139941541631808 data_utils.py:795] Use the following tfrecord dirs: ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords']
I0717 05:53:04.859261 139941541631808 data_utils.py:799] [0] Record glob: /net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/record_info-train-*.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.json
I0717 05:53:04.859811 139941541631808 data_utils.py:803] [0] Num of record info path: 1
I0717 05:53:04.860336 139941541631808 data_utils.py:836] [Dir 0] Number of chosen batches: 692053
I0717 05:53:04.860396 139941541631808 data_utils.py:838] [Dir 0] Number of chosen files: 1
I0717 05:53:04.860434 139941541631808 data_utils.py:839] ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/train-0-0.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.tfrecords']
I0717 05:53:04.860471 139941541631808 data_utils.py:846] Total number of batches: 692053
I0717 05:53:04.860699 139941541631808 data_utils.py:848] Total number of files: 1
I0717 05:53:04.860738 139941541631808 data_utils.py:849] ['/net/vaosl01/opt/NFS/mimic_hv4/data/xlnetMimic/tfrecords/train-0-0.bsz-8.seqlen-512.reuse-256.uncased.bi.alpha-6.beta-1.fnp-100.tfrecords']
I0717 05:53:04.860776 139941541631808 train_gpu.py:204] num of batches 692053
I0717 05:53:04.860822 139941541631808 data_utils.py:555] Host 0 handles 1 files
I0717 05:53:05.006074 139941541631808 data_utils.py:744] label: Tensor("Cast_6:0", shape=(1,), dtype=int32)
I0717 05:53:05.006270 139941541631808 data_utils.py:744] seg_id: Tensor("Cast_7:0", shape=(512,), dtype=int32)
I0717 05:53:05.006346 139941541631808 data_utils.py:744] target_mapping: Tensor("Reshape_4:0", shape=(100, 512), dtype=float32)
I0717 05:53:05.006417 139941541631808 data_utils.py:744] target: Tensor("Cast_8:0", shape=(100,), dtype=int32)
I0717 05:53:05.006478 139941541631808 data_utils.py:744] target_mask: Tensor("Reshape_6:0", shape=(100,), dtype=float32)
I0717 05:53:05.006537 139941541631808 data_utils.py:744] perm_mask: Tensor("Reshape_7:0", shape=(512, 512), dtype=float32)
I0717 05:53:05.006598 139941541631808 data_utils.py:744] input_k: Tensor("Cast_9:0", shape=(512,), dtype=int32)
I0717 05:53:05.006661 139941541631808 data_utils.py:744] input_q: Tensor("Reshape_9:0", shape=(512,), dtype=float32)
I0717 05:53:05.063619 139941541631808 modeling.py:454] memory input [<tf.Tensor 'Placeholder:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_1:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_2:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_3:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_4:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_5:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_6:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_7:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_8:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_9:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_10:0' shape=(384, 1, 768) dtype=float32>, <tf.Tensor 'Placeholder_11:0' shape=(384, 1, 768) dtype=float32>]
I0717 05:53:05.063748 139941541631808 modeling.py:456] Use float type <dtype: 'float32'>
W0717 05:53:05.069464 139941541631808 deprecation.py:323] From /net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
W0717 05:53:05.121577 139941541631808 deprecation.py:323] From /net/vaosl01/opt/NFS/hv4/xlnet-pretrain-mimic/xlnet/modeling.py:536: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W0717 05:53:05.122740 139941541631808 deprecation.py:506] From /net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/keras/layers/core.py:143: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
I0717 05:53:05.153150 139941541631808 modeling.py:236] *******************************************batch size before assertion fail Tensor("model/transformer/strided_slice:0", shape=(), dtype=int32, device=/gpu:0)**************************************************
Traceback (most recent call last):
File "/net/vaosl01/opt/NFS/mimic_hv4/anaconda3/envs/henry/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1659, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 0. Shapes are [1,12] and [0,12]. for 'model/transformer/layer_0/rel_attn/einsum_5/MatMul' (op: 'BatchMatMul') with input shapes: [1,12,512,64], [0,12,64,1408].
Hi,
I still have this error when I run the pretraining. Is that possible that I miss out something? Thank you!
@vanh17 I have this error when I run the pretrain xlnet-base
from scracth on other language and change a parameter in config file(I forget which one).
@lsq357 @vanh17
Hi,
I have same issue on other language and I have very hard time to find solutions.
@lsq357 please let me know which one file you changed. If you forgot, just put some files you saw.
And also, If I would found causes , I will put them on this thread.
Hope you leave some message here.
@vanh17 Hi,
I had the same issue. The commands I used to create tfrecord and train were as follows.
python data_utils.py \
--bsz_per_host=8 \
--num_core_per_host=4 \
--seq_len=512 \
--reuse_len=256 \
--input_glob=../preprocessed_thaiwikitext/*.txt \
--save_dir=../tf_record_out \
--num_passes=20 \
--bi_data=True \
--sp_path=../thwiki.model \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85
for making tfrecord and
python train_gpu.py \
--record_info_dir=../tf_record_out/tfrecords \
--model_dir=../model_dir \
--uncased=True \
--train_batch_size=8 \
--seq_len=512 \
--reuse_len=256 \
--mem_len=384 \
--perm_size=256 \
--n_layer=6 \
--d_model=768 \
--d_embed=768 \
--n_head=16 \
--d_head=64 \
--d_inner=2048 \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85 \
for training. I think it may have something to do with the batch size being too small or something, because of the op: 'BatchMatMul' in the error. So I tried making the batch size bigger and compensate by lowering seq_len
to fit on the Colab GPU. So now I have
python data_utils.py \
--bsz_per_host=32 \
--num_core_per_host=8 \
--seq_len=128 \
--reuse_len=64 \
--input_glob=../preprocessed_thaiwikitext/*.txt \
--save_dir=../tf_record_out \
--num_passes=20 \
--bi_data=True \
--sp_path=../thaiwiki.model \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85
for making tfrecord and
python train_gpu.py \
--record_info_dir=../tf_record_out/tfrecords \
--model_dir=../model_dir \
--uncased=True \
--train_batch_size=32 \
--seq_len=128 \
--reuse_len=64 \
--mem_len=384 \
--perm_size=64 \
--n_layer=6 \
--d_model=768 \
--d_embed=768 \
--n_head=8 \
--d_head=64 \
--d_inner=2048 \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85 \
--save_steps=5000
for training. Now it's been training for 30 mins without any errors.
The tensorflow version is 1.13.1.