bert icon indicating copy to clipboard operation
bert copied to clipboard

how use the pretrain checkpoint to continue train on my own corpus?

Open RyanHuangNLP opened this issue 5 years ago • 7 comments

I want to load the pretrain checkpoint to continue train on my own corpus, I use the run_pretrain.py code and set the init_checkpoint to the pretrain dir, while I run the code, it raise error?

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I know that when finish training, it is better to remove adam_m and adam_v parameter to reduce the size of the checkpoint file, but I while want to continue train on the pretrain checkpoint, how to sovle this problem, may be I can recovert adam reference variable name in the checkpoint file ?thank you

RyanHuangNLP avatar Oct 25 '19 13:10 RyanHuangNLP

I want to load the pretrain checkpoint to continue train on my own corpus, I use the run_pretrain.py code and set the init_checkpoint to the pretrain dir, while I run the code, it raise error?

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I know that when finish training, it is better to remove adam_m and adam_v parameter to reduce the size of the checkpoint file, but I while want to continue train on the pretrain checkpoint, how to sovle this problem, may be I can recovert adam reference variable name in the checkpoint file ?thank you


I ran into a similar issue while trying to load a model trained under Tensorflow 1.x to work under the upgraded version under Tensorflow 2.0. If you have solved the issue, please share your approach.

ibrahimishag avatar Jan 17 '20 01:01 ibrahimishag

@ibrahimishag tensorflow 2.0 variable name is different with the tensorflow 1.x, you may reference here.

RyanHuangNLP avatar Jan 17 '20 08:01 RyanHuangNLP

Hi all! I'm trying to initiate from mBERT checkpoint but it is missing the "bert/embeddings/LayerNorm/beta/adam_m" key in the list of variables (just like you described). I'm using TF=1.14 and have not found a solution in the conversion of checkpoint in TF >2. Did you find a solution?

manueltonneau avatar Feb 05 '20 09:02 manueltonneau

Hi @RyanHuangNLP, if you have found a solution for this problem, would you mind sharing it? :)

manueltonneau avatar Feb 12 '20 14:02 manueltonneau

Hi, I am also facing the same issue : While trying to train from the mbert checkpoint : Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint While trying to predict from the mbert checkpoint : Key global_step not found in checkpoint @RyanHuangNLP Did you find a solution for the same? Thanks in advance!

AakritiBudhraja avatar Aug 05 '20 10:08 AakritiBudhraja

Kindly share the solution if someone knows. Thanks

geo47 avatar Nov 16 '20 10:11 geo47

Did anyone figure out a solution for this? I'm facing the same problem. Kindly share if someone does know.

nikhildurgam95 avatar Nov 10 '21 03:11 nikhildurgam95