tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

When following Magenta's Score2perf's README, checkpoint doesn't have some keys

Open heyzude opened this issue 4 years ago • 2 comments

Ubuntu 18.04 Python 3.7.9 Tensorflow 2.3.1

When I follow https://github.com/magenta/magenta/blob/master/magenta/models/score2perf/README.md, The problem happens when I follow Training and Sampling from the model part.

The Training command is like below at the README.

DATA_DIR=/generated/tfrecords/dir
HPARAMS_SET=score2perf_transformer_base
MODEL=transformer
PROBLEM=score2perf_maestro_language_uncropped_aug
TRAIN_DIR=/training/dir

HPARAMS=\
"label_smoothing=0.0,"\
"max_length=0,"\
"max_target_seq_length=2048"

t2t_trainer \
  --data_dir="${DATA_DIR}" \
  --hparams=${HPARAMS} \
  --hparams_set=${HPARAMS_SET} \
  --model=${MODEL} \
  --output_dir=${TRAIN_DIR} \
  --problem=${PROBLEM} \
  --train_steps=1000000

when I do as what Training at README says, I got this error, after training 1000 epoches, and the python file tries to load from 1000 epoch checkpoint and to evaluaiton.

Not found: Key transformer/parallel_0_3/transformer/transformer/body/decoder/layer_0/self_attention/multihead_attention/k/kernel not found in checkpoint
[[node save/RestoreV2_1 (defined at /.pyenv/versions/3.7.9/envs/tensor2tensor/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]]
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

However, When I run the command at Training again, surprisingly, it succeeds to load from checkpoint and train from 1000 epoch, and save 2000 epoch weights. But then again, when it loads from 2000 epoch checkpoint and try to do evaluaiton, it fails.

For Inference (Sampling from the model), it just fails.

Anyone could help me? Thanks in advance.

heyzude avatar Dec 03 '20 06:12 heyzude

I meet the same problem with tensorflow 2.4.0. When I tried to load a checkpoint downloaded from magenta as in the colab, it fails. When I run the training commands, the checkpoints saved at 1000 epochs cannot be loaded.

dongmingli-Ben avatar Jun 04 '21 13:06 dongmingli-Ben

Also hitting this issue right now, training fresh.

almostimplemented avatar May 17 '22 10:05 almostimplemented