Condenser
Condenser copied to clipboard
Unable to resume CoCondenser pretraining
The model checkpoints seem to be hard-coded as the BertForMaskedLM and are unable to load but to the CoCondensor class. Adding the following attributes in the initialization can surpass the exceptions but all the weights were not loaded.
self._keys_to_ignore_on_save = None
self._keys_to_ignore_on_load_missing = None
Is there a way to resume training after interruptions? Thanks!
Please elaborate on the issue. Include what you did, what worked and did not worked, error messages, etc.
Here is the way to reproduce the exception.
I first start training from the model downloaded from huggingface.
HF_DATASETS_CACHE="/expscratch/eyang/cache/datasets" TOKENIZERS_PARALLELISM="false"\
python run_co_pre_training.py \
--output_dir ./test/bert-base-cased/ \
--model_name_or_path bert-base-cased \
--do_train \
--fp16 \
--save_steps 1 \
--save_total_limit 10 \
--model_type bert \
--per_device_train_batch_size 256 \
--cache_chunk_size 12 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.1 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--dataloader_drop_last \
--overwrite_output_dir \
--dataloader_num_workers 10 \
--n_head_layers 2 \
--skip_from 6 \
--max_seq_length 180 \
--train_path ./processed_text/msmarco-document.span-90.tokenized-bert-base_incomplete.jsonl \
--weight_decay 0.01 \
--late_mlm
I tried to load the checkpoint from the first step. --
HF_DATASETS_CACHE="/expscratch/eyang/cache/datasets" TOKENIZERS_PARALLELISM="false"\
python run_co_pre_training.py \
--output_dir ./test/bert-base-cased/ \
--model_name_or_path ./test/bert-base-cased/checkpoint-1 \
--do_train \
--fp16 \
--save_steps 100 \
--save_total_limit 10 \
--model_type bert \
--per_device_train_batch_size 256 \
--cache_chunk_size 12 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.1 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--dataloader_drop_last \
--overwrite_output_dir \
--dataloader_num_workers 10 \
--n_head_layers 2 \
--skip_from 6 \
--max_seq_length 180 \
--train_path ./processed_text/msmarco-document.span-90.tokenized-bert-base_incomplete.jsonl \
--weight_decay 0.01 \
--late_mlm
and here is the exception.
[INFO|tokenization_utils_base.py:1671] 2021-12-13 16:30:45,404 >> Didn't find file ./test/bert-base-cased/checkpoint-1/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/vocab.txt
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file None
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/special_tokens_map.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:30:45,404 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer_config.json
[INFO|modeling_utils.py:1350] 2021-12-13 16:30:45,426 >> loading weights file ./test/bert-base-cased/checkpoint-1/pytorch_model.bin
[INFO|modeling_utils.py:1619] 2021-12-13 16:30:47,089 >> All model checkpoint weights were used when initializing BertForMaskedLM.
[INFO|modeling_utils.py:1627] 2021-12-13 16:30:47,089 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at ./test/bert-base-cased/checkpoint-1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
12/13/2021 16:30:47 - INFO - modeling - loading extra weights from local files
12/13/2021 16:30:47 - INFO - trainer - Initializing Gradient Cache Trainer
[INFO|trainer.py:439] 2021-12-13 16:30:51,616 >> Using amp half precision backend
/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py:1059: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
warnings.warn(
[INFO|trainer.py:1089] 2021-12-13 16:30:51,618 >> Loading model from ./test/bert-base-cased/checkpoint-1).
Traceback (most recent call last):
File "run_co_pre_training.py", line 227, in <module>
main()
File "run_co_pre_training.py", line 217, in main
trainer.train(model_path=model_path)
File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py", line 1108, in train
self._load_state_dict_in_model(state_dict)
File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py", line 1484, in _load_state_dict_in_model
if self.model._keys_to_ignore_on_save is not None and set(load_result.missing_keys) == set(
File "/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'CoCondenserForPretraining' object has no attribute '_keys_to_ignore_on_save'
We can surpass this exception by adding the two flags here. https://github.com/luyug/Condenser/blob/main/modeling.py#L177
And execute the same command above to, here are the warnings (clipped but basically all the layers)
[INFO|tokenization_utils_base.py:1671] 2021-12-13 16:34:43,748 >> Didn't find file ./test/bert-base-cased/checkpoint-1/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/vocab.txt
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file None
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/special_tokens_map.json
[INFO|tokenization_utils_base.py:1740] 2021-12-13 16:34:43,749 >> loading file ./test/bert-base-cased/checkpoint-1/tokenizer_config.json
[INFO|modeling_utils.py:1350] 2021-12-13 16:34:43,770 >> loading weights file ./test/bert-base-cased/checkpoint-1/pytorch_model.bin
[INFO|modeling_utils.py:1619] 2021-12-13 16:34:45,435 >> All model checkpoint weights were used when initializing BertForMaskedLM.
[INFO|modeling_utils.py:1627] 2021-12-13 16:34:45,435 >> All the weights of BertForMaskedLM were initialized from the model checkpoint at ./test/bert-base-cased/checkpoint-1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
12/13/2021 16:34:45 - INFO - modeling - loading extra weights from local files
12/13/2021 16:34:45 - INFO - trainer - Initializing Gradient Cache Trainer
[INFO|trainer.py:439] 2021-12-13 16:34:49,899 >> Using amp half precision backend
/home/hltcoe/eyang/.conda/envs/pretrain/lib/python3.8/site-packages/transformers/trainer.py:1059: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
warnings.warn(
[INFO|trainer.py:1089] 2021-12-13 16:34:49,901 >> Loading model from ./test/bert-base-cased/checkpoint-1).
[WARNING|trainer.py:1489] 2021-12-13 16:34:50,315 >> There were missing keys in the checkpoint model loaded: ['co_target', 'lm.bert.embeddings.position_ids', 'lm.bert.embeddings.word_embeddings.weight', 'lm.bert.embeddings.position_embeddings.weight', 'lm.bert.embeddings.token_type_embeddings.weight', 'lm.bert.embeddings.LayerNorm.weight', 'lm.bert.embeddings.LayerNorm.bias', 'lm
.bert.encoder.layer.0.attention.self.query.weight', 'lm.bert.encoder.layer.0.attention.self.query.bias', 'lm.bert.encoder.layer.0.attention.self.key.weight', 'lm.bert.encoder.layer.0.attention.self.key.bias', 'lm.bert.encoder.layer.0.attention.self.value.weight', 'lm.bert.encoder.layer.0.attention.self.value.bias', 'lm.bert.encoder.layer.0.attention.output.dense.weight', 'lm.bert
.encoder.layer.0.attention.output.dense.bias', 'lm.bert.encoder.layer.0.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.0.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.0.intermediate.dense.weight', 'lm.bert.encoder.layer.0.intermediate.dense.bias', 'lm.bert.encoder.layer.0.output.dense.weight', 'lm.bert.encoder.layer.0.output.dense.bias', 'lm.bert.encoder.la
yer.0.output.LayerNorm.weight', 'lm.bert.encoder.layer.0.output.LayerNorm.bias', 'lm.bert.encoder.layer.1.attention.self.query.weight', 'lm.bert.encoder.layer.1.attention.self.query.bias', 'lm.bert.encoder.layer.1.attention.self.key.weight', 'lm.bert.encoder.layer.1.attention.self.key.bias', 'lm.bert.encoder.layer.1.attention.self.value.weight', 'lm.bert.encoder.layer.1.attention
.self.value.bias', 'lm.bert.encoder.layer.1.attention.output.dense.weight', 'lm.bert.encoder.layer.1.attention.output.dense.bias', 'lm.bert.encoder.layer.1.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.1.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.1.intermediate.dense.weight', 'lm.bert.encoder.layer.1.intermediate.dense.bias', 'lm.bert.encoder.layer.1.ou
tput.dense.weight', 'lm.bert.encoder.layer.1.output.dense.bias', 'lm.bert.encoder.layer.1.output.LayerNorm.weight', 'lm.bert.encoder.layer.1.output.LayerNorm.bias', 'lm.bert.encoder.layer.2.attention.self.query.weight', 'lm.bert.encoder.layer.2.attention.self.query.bias', 'lm.bert.encoder.layer.2.attention.self.key.weight', 'lm.bert.encoder.layer.2.attention.self.key.bias', 'lm.b
ert.encoder.layer.2.attention.self.value.weight', 'lm.bert.encoder.layer.2.attention.self.value.bias', 'lm.bert.encoder.layer.2.attention.output.dense.weight', 'lm.bert.encoder.layer.2.attention.output.dense.bias', 'lm.bert.encoder.layer.2.attention.output.LayerNorm.weight', 'lm.bert.encoder.layer.2.attention.output.LayerNorm.bias', 'lm.bert.encoder.layer.2.intermediate.dense.wei
ght', 'lm.bert.encoder.layer.2.intermediate.dense.bias', 'lm.bert.encoder.layer.2.output.dense.weight', 'lm.bert.encoder.layer.2.output.dense.bias', 'lm.bert.encoder.layer.2.output.LayerNorm.weight', 'lm.bert.encoder.layer.2.output.LayerNorm.bias', 'lm.bert.encoder.layer.3.attention.self.query.weight', 'lm.bert.encoder.layer.3.attention.self.query.bias', 'lm.bert.encoder.layer.3.
attention.self.key.weight', 'lm.bert.encoder.layer.3.attention.self.key.bias', 'lm.bert.encoder.layer.3.attention.self.value.weight', 'lm.bert.encoder.layer.3.attention.self.value.bias', ...
The attribute _keys_to_ignore_on_save
is introduced in a relatively recent release of hf transformers. Maybe I should patch the repo but for now a few easy things you can do,
- get a earlier version of
transformers
. I used4.2.0
in my experiments. - Set
model_path=None
here.
Thank you for the reply!
Isn't setting model_path=None
basically telling the trainer to start from scratch and ignore the checkpoint?
Would it makes more sense to put the path of the checkpoint we want to resume from at here (like ./test/bert-base-cased/checkpoint-1
in the example) and leave the rest of the model_name_or_path
as the original model like bert-base-cased
in the example?
Thank you for the reply! Isn't setting
model_path=None
basically telling the trainer to start from scratch and ignore the checkpoint?
Yes, and the CoCondensr
object will do the loading. You will see a log when it does so. Letting CoCondensr
class do the loading makes sure that we can handle multiple load scenarios.
This is more or less a WAR. Eventually, I probably need to patch the CondenserPreTrainer
class s.t. it will no longer load model weights.
Maybe I am missing something, but from what I can read, using model_path=None
and loading the model from CondenserForPretraining is actually doing exactly the same thing as trainer.train(resume_from_checkpoint=model_args.model_name_or_path)
. You will just get rid of the warning but the loading should be exactly the same. If you print missing_keys
from the custom from_pretrained
classmethod of CondenserForPretraining
, you'll see it contains the same keys that are logged in the warning.
Maybe ignoring those keys on save is a cleaner solution, but in the end, it should not change anything to the training.