xlnet Cache problem during pretraining

Durinig pretraining, after saving checkpoint below error occurs.

I0712 06:47:22.892611 140596004366080 tf_logging.py:115] [99000] | gnorm 0.71 lr 0.000001 | loss 7.25 | pplx 1408.25, bpc 10.4597
I0712 07:13:05.624328 140596004366080 tf_logging.py:115] [100000] | gnorm 1.03 lr 0.000000 | loss 7.25 | pplx 1406.88, bpc 10.4583
I0712 07:13:34.885596 140596004366080 tf_logging.py:115] Model saved in path: /home/xlnet_exam/models_wiki_ja/model.ckpt
2019-07-12 07:13:34.961923: W tensorflow/core/kernels/data/cache_dataset_ops.cc:770] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the datasetwill be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.

in data_utils.py, dataset cache order seems to be the same as above waring suggestion, (but note exists.)

...
  # (zihang): since we are doing online preprocessing, the parsed result of
  # the same input at each time will be different. Thus, cache processed data
  # is not helpful. It will use a lot of memory and lead to contrainer OOM.
  # So, change to cache non-parsed raw data instead.
  dataset = dataset.cache().map(parser).repeat()
  dataset = dataset.batch(bsz_per_core, drop_remainder=True)
  dataset = dataset.prefetch(num_core_per_host * bsz_per_core)
...

my env

GPU: Tesla V100 32GB *4
CUDA_VERSION: 9.0.176
TENSORFLOW_VERSION: 1.11.0

my pretrain command

python train_gpu.py \
      --record_info_dir=${TFRECORD_DIR} \
      --num_core_per_host=1 \
      --train_batch_size=4 \
      --save_steps=10000 \
      --model_dir=${MODEL_DIR} \
      --seq_len=512 \
      --reuse_len=256 \
      --mem_len=384 \
      --perm_size=256 \
      --n_layer=24 \
      --d_model=1024 \
      --d_embed=1024 \
      --n_head=16 \
      --d_head=64 \
      --d_inner=4096 \
      --untie_r=True \
      --mask_alpha=6 \
      --mask_beta=1 \
      --num_predict=85 \
      --uncased=True

Is it okay? or problem?