xlnet
xlnet copied to clipboard
AssertionError in pretrain an XLNet . train_gpu.py.
python train_gpu.py --corpus_info_path=G:/XLNetData/tftest/corpus_info.json --record_info_dir="G:/XLNetData/tftest/tfrecords" --model_dir="" --train_batch_size=8 --seq_len=128 --reuse_len=64 --mem_len=96 --perm_size=32 --n_layer=6 --d_model=1024 --d_embed=1024 --n_head=16 --d_head=64 --d_inner=4096 --untie_r=True --mask_alpha=6 --mask_beta=1 --num_predict=21 --uncased=true --num_hosts=1 --num_core_per_host=1
I got this error immediately:
I0730 15:41:08.028456 8780 tf_logging.py:115] n_token 32000
I0730 15:41:08.029418 8780 tf_logging.py:115] Use the following tfrecord dirs: ['G:/XLNetData/tftest/tfrecords']
I0730 15:41:08.030418 8780 tf_logging.py:115] [0] Record glob: G:/XLNetData/tftest/tfrecords\record_info-train-*.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.json
I0730 15:41:08.033409 8780 tf_logging.py:115] [0] Num of record info path: 1
I0730 15:41:08.034406 8780 tf_logging.py:115] [Dir 0] Number of chosen batches: 97922
I0730 15:41:08.034406 8780 tf_logging.py:115] [Dir 0] Number of chosen files: 1
I0730 15:41:08.034406 8780 tf_logging.py:115] ['G:/XLNetData/tftest/tfrecords\train-0-0.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.tfrecords']
I0730 15:41:08.034406 8780 tf_logging.py:115] Total number of batches: 97922
I0730 15:41:08.035402 8780 tf_logging.py:115] Total number of files: 1
I0730 15:41:08.035402 8780 tf_logging.py:115] ['G:/XLNetData/tftest/tfrecords\train-0-0.bsz-8.seqlen-128.reuse-64.uncased.bi.alpha-6.beta-1.fnp-21.tfrecords']
I0730 15:41:08.035402 8780 tf_logging.py:115] num of batches 97922
I0730 15:41:08.035402 8780 tf_logging.py:115] Host 0 handles 1 files
I0730 15:41:08.245867 8780 tf_logging.py:115] label: Tensor("Cast_6:0", shape=(1,), dtype=int32)
I0730 15:41:08.245867 8780 tf_logging.py:115] seg_id: Tensor("Cast_7:0", shape=(128,), dtype=int32)
I0730 15:41:08.247835 8780 tf_logging.py:115] target_mapping: Tensor("Reshape_4:0", shape=(21, 128), dtype=float32)
I0730 15:41:08.248839 8780 tf_logging.py:115] target: Tensor("Cast_8:0", shape=(21,), dtype=int32)
I0730 15:41:08.257808 8780 tf_logging.py:115] target_mask: Tensor("Reshape_6:0", shape=(21,), dtype=float32)
I0730 15:41:08.259802 8780 tf_logging.py:115] perm_mask: Tensor("Reshape_7:0", shape=(128, 128), dtype=float32)
I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32)
I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32)
I0730 15:41:08.414390 8780 tf_logging.py:115] memory input [<tf.Tensor 'Placeholder:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_1:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_2:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_3:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_4:0' shape=(96, 8, 1024) dtype=float32>, <tf.Tensor 'Placeholder_5:0' shape=(96, 8, 1024) dtype=float32>]
I0730 15:41:08.414390 8780 tf_logging.py:115] Use float type <dtype: 'float32'>
Traceback (most recent call last):
File "train_gpu.py", line 328, in
What should I do? Thank you!
Can you print out inp_k
(and its shape) somewhere and post the output? Maybe in the two_stream_loss
function on line 44 of function_builder.py
?
@brendanxwhitaker
I have added “print(inp_k) print(inp_k.shape)” in the two_stream_loss function,but it didn't print out. And I got this information: I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32) I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32)
That’s interesting. Sorry I probably should have specified but you might want to print the evaluated tensor or cast it to a numpy array to print nicely.
Is that function being called at all? Does a generic print statement execute? I was following your stack trace and I thought I saw a call to that function.
I’m trying to figure out if something funky is happening when it reshapes the inputs after grabbing from the feature dict. I think it’s flattened before being added to tfrecords, and then unflattened again for training. The unflattened shape of inp_k is used to set bsz, which somehow is odd for that assert statement (if it’s an integer).
But it doesn’t throw an error? So the variable exists. Perhaps it’s None.
@brendanxwhitaker
I have added “print(inp_k) print(inp_k.shape)” in the two_stream_loss function,but it didn't print out. And I got this information: I0730 15:41:08.266821 8780 tf_logging.py:115] input_k: Tensor("Cast_9:0", shape=(128,), dtype=int32) I0730 15:41:08.270774 8780 tf_logging.py:115] input_q: Tensor("Reshape_9:0", shape=(128,), dtype=float32)
When I used TF1.4 and PY3.7, I had met the same issue and corrected it as below: In modeling.py line 470, modified “ bsz = tf.shape(inp_k)[1] ” to " bsz = inp_k.get_shape()[1] "
And it works.
@ft3020997 According to your method, it works!Thank you very much.
You are getting this error because this assertion is not implemented properly. bsz in relative_positional_encoding is inferred from the shape of the input, this makes it a tensor. And % is not meant for tensors, which is why it always fails the assertion every single time. You can either just delete this thing, or rewrite it using tf.assert_equal and tf.control_dependencies. If you do plan to delete it, make sure that per core batch size is divisible by 2 or strange error will begin to show up.