unilm
unilm copied to clipboard
Error in loading checkpoint shards of TextDiffuser-2.
Describe the bug Model I am using (TextDiffuser-2 ...):
The problem arises when using the official example scripts with the code:
python3 -m fastchat.serve.cli --model-path /textdiffuser-2/experiment_result/checkpoint-50/ --device cpu
A clear and concise description of what the bug is.
I tried to replicate the layout planning model using the script train_layout_planner.sh
.
During inference of the fine-tuned checkpoint, I got an error while loading the third checkpoint shards.
The Error:
python3 -m fastchat.serve.cli --model-path /textdiffuser-2/experiment_result/checkpoint-50/ --device cpu
Loading checkpoint shards: 67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 2/3 [00:13<00:06, 6.99s/it]
Traceback (most recent call last):
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 442, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/serialization.py", line 1014, in load
return _load(opened_zipfile,
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/serialization.py", line 1422, in _load
result = unpickler.load()
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/_utils.py", line 202, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/_utils.py", line 181, in _rebuild_tensor
return t.set_(storage._untyped_storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 446, in load_state_dict
if f.read(7) == "version":
File "/miniconda3/envs/textdiffuser2/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/miniconda3/envs/textdiffuser2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/miniconda3/envs/textdiffuser2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/cli.py", line 280, in <module>
main(args)
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/cli.py", line 206, in main
chat_loop(
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/inference.py", line 307, in chat_loop
model, tokenizer = load_model(
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/model/model_adapter.py", line 278, in load_model
model, tokenizer = adapter.load_model(model_path, kwargs)
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/model/model_adapter.py", line 73, in load_model
model = AutoModelForCausalLM.from_pretrained(
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
) = cls._load_pretrained_model(
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3109, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 458, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '/textdiffuser-2/experiment_result/checkpoint-50/pytorch_model-00003-of-00003.bin' at '/textdiffuser-2/experiment_result/checkpoint-50/pytorch_model-00003-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
I also tried to use CPU Only inference and got the same error.
Hello, thanks for your interest in TextDiffuser-2! May I ask what is the version of the Transformer library?
Thanks for replying. I am using transformers 4.28.1. I found that the bug only appears when using flash attention (train_mem.py). When using train.py, checkpoints are saved correctly. In that case, the third shard is ~7GB in size, while using train_mem.py, the corrupted third shard is ~5GB.