unilm Error in loading checkpoint shards of TextDiffuser-2.

Describe the bug Model I am using (TextDiffuser-2 ...):

The problem arises when using the official example scripts with the code:

python3 -m fastchat.serve.cli --model-path /textdiffuser-2/experiment_result/checkpoint-50/ --device cpu

A clear and concise description of what the bug is. I tried to replicate the layout planning model using the script train_layout_planner.sh. During inference of the fine-tuned checkpoint, I got an error while loading the third checkpoint shards.

The Error:

python3 -m fastchat.serve.cli --model-path /textdiffuser-2/experiment_result/checkpoint-50/ --device cpu
Loading checkpoint shards:  67%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                                         | 2/3 [00:13<00:06,  6.99s/it]
Traceback (most recent call last):
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 442, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/_utils.py", line 202, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/torch/_utils.py", line 181, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 446, in load_state_dict
    if f.read(7) == "version":
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/cli.py", line 280, in <module>
    main(args)
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/cli.py", line 206, in main
    chat_loop(
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/serve/inference.py", line 307, in chat_loop
    model, tokenizer = load_model(
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/model/model_adapter.py", line 278, in load_model
    model, tokenizer = adapter.load_model(model_path, kwargs)
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/fastchat/model/model_adapter.py", line 73, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3109, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
  File "/miniconda3/envs/textdiffuser2/lib/python3.8/site-packages/transformers/modeling_utils.py", line 458, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '/textdiffuser-2/experiment_result/checkpoint-50/pytorch_model-00003-of-00003.bin' at '/textdiffuser-2/experiment_result/checkpoint-50/pytorch_model-00003-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

I also tried to use CPU Only inference and got the same error.

Dec 15 '23 21:12 puar-playground

Hello, thanks for your interest in TextDiffuser-2! May I ask what is the version of the Transformer library?

Dec 16 '23 07:12 JingyeChen

Thanks for replying. I am using transformers 4.28.1. I found that the bug only appears when using flash attention (train_mem.py). When using train.py, checkpoints are saved correctly. In that case, the third shard is ~7GB in size, while using train_mem.py, the corrupted third shard is ~5GB.

Dec 25 '23 19:12 puar-playground

unilm unilm copied to clipboard

Error in loading checkpoint shards of TextDiffuser-2.

unilm
unilm copied to clipboard