xlnet
xlnet copied to clipboard
Cache problem during pretraining
Durinig pretraining, after saving checkpoint below error occurs.
I0712 06:47:22.892611 140596004366080 tf_logging.py:115] [99000] | gnorm 0.71 lr 0.000001 | loss 7.25 | pplx 1408.25, bpc 10.4597
I0712 07:13:05.624328 140596004366080 tf_logging.py:115] [100000] | gnorm 1.03 lr 0.000000 | loss 7.25 | pplx 1406.88, bpc 10.4583
I0712 07:13:34.885596 140596004366080 tf_logging.py:115] Model saved in path: /home/xlnet_exam/models_wiki_ja/model.ckpt
2019-07-12 07:13:34.961923: W tensorflow/core/kernels/data/cache_dataset_ops.cc:770] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
in data_utils.py, dataset cache order seems to be the same as above waring suggestion, (but note exists.)
...
# (zihang): since we are doing online preprocessing, the parsed result of
# the same input at each time will be different. Thus, cache processed data
# is not helpful. It will use a lot of memory and lead to contrainer OOM.
# So, change to cache non-parsed raw data instead.
dataset = dataset.cache().map(parser).repeat()
dataset = dataset.batch(bsz_per_core, drop_remainder=True)
dataset = dataset.prefetch(num_core_per_host * bsz_per_core)
...
my env
GPU: Tesla V100 32GB *4
CUDA_VERSION: 9.0.176
TENSORFLOW_VERSION: 1.11.0
my pretrain command
python train_gpu.py \
--record_info_dir=${TFRECORD_DIR} \
--num_core_per_host=1 \
--train_batch_size=4 \
--save_steps=10000 \
--model_dir=${MODEL_DIR} \
--seq_len=512 \
--reuse_len=256 \
--mem_len=384 \
--perm_size=256 \
--n_layer=24 \
--d_model=1024 \
--d_embed=1024 \
--n_head=16 \
--d_head=64 \
--d_inner=4096 \
--untie_r=True \
--mask_alpha=6 \
--mask_beta=1 \
--num_predict=85 \
--uncased=True
Is it okay? or problem?
Maybe you should try newer TF. The official used TensorFlow 1.13.1.
@ymcui Thanks a lot! I'll try with TensorFlow 1.13.1.
I tried with Tensorflow 1.13.1 and 1.4.0 but it still occurs. And output files are broken.
same problem.:(
same problem...did you solve it?