stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Checkpoint fails to load after training

Open ssemeniuta opened this issue 2 years ago • 3 comments

When trying to load a model from output directory of train.py I get

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 442, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/usr/local/lib64/python3.8/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib64/python3.8/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/usr/local/lib64/python3.8/site-packages/torch/_utils.py", line 138, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/usr/local/lib64/python3.8/site-packages/torch/_utils.py", line 134, in _rebuild_tensor
    return t.set_(storage._untyped(), storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 446, in load_state_dict
    if f.read(7) == "version":
  File "/usr/lib64/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate.py", line 102, in <module>
    inference()
  File "generate.py", line 39, in inference
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3109, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 458, in load_state_dict
    raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '../alpaca_output/pytorch_model-00003-of-00003.bin' at '../alpaca_output/pytorch_model-00003-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Do I need to postprocess a checkpoint, e.g. covert using a script?

ssemeniuta avatar May 02 '23 13:05 ssemeniuta

same problem

xulangping avatar May 04 '23 09:05 xulangping

while use 7b, there is no problem , but when use 13b, same problem for 00005.bin, maybe oom? @ssemeniuta @thashim

xulangping avatar May 04 '23 10:05 xulangping

Loading checkpoint shards: 67%|████████████ | 4/6 [00:27<00:13, 6.94s/it] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:415 in load_state_dict │ │ │ │ 412 │ │ │ ) │ │ 413 │ │ return safe_load_file(checkpoint_file) │ │ 414 │ try: │ │ ❱ 415 │ │ return torch.load(checkpoint_file, map_location="cpu") │ │ 416 │ except Exception as e: │ │ 417 │ │ try: │ │ 418 │ │ │ with open(checkpoint_file) as f: │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/seria │ │ lization.py:789 in load │ │ │ │ 786 │ │ │ │ │ │ return _load(opened_zipfile, map_location, _w │ │ 787 │ │ │ │ │ except RuntimeError as e: │ │ 788 │ │ │ │ │ │ raise pickle.UnpicklingError(UNSAFE_MESSAGE + │ │ ❱ 789 │ │ │ │ return _load(opened_zipfile, map_location, pickle_mod │ │ 790 │ │ if weights_only: │ │ 791 │ │ │ try: │ │ 792 │ │ │ │ return _legacy_load(opened_file, map_location, _weigh │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/seria │ │ lization.py:1131 in _load │ │ │ │ 1128 │ │ │ 1129 │ unpickler = UnpicklerWrapper(data_file, **pickle_load_args) │ │ 1130 │ unpickler.persistent_load = persistent_load │ │ ❱ 1131 │ result = unpickler.load() │ │ 1132 │ │ │ 1133 │ torch._utils._validate_loaded_sparse_tensors() │ │ 1134 │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/_util │ │ s.py:153 in _rebuild_tensor_v2 │ │ │ │ 150 def _rebuild_tensor_v2( │ │ 151 │ storage, storage_offset, size, stride, requires_grad, backward_hoo │ │ 152 ): │ │ ❱ 153 │ tensor = _rebuild_tensor(storage, storage_offset, size, stride) │ │ 154 │ tensor.requires_grad = requires_grad │ │ 155 │ # NB: This line exists only for backwards compatibility; the │ │ 156 │ # general expectation is that backward_hooks is an empty │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/_util │ │ s.py:147 in _rebuild_tensor │ │ │ │ 144 def rebuild_tensor(storage, storage_offset, size, stride): │ │ 145 │ # first construct a tensor with the correct dtype/device │ │ 146 │ t = torch.tensor([], dtype=storage.dtype, device=storage.untyped() │ │ ❱ 147 │ return t.set(storage.untyped(), storage_offset, size, stride) │ │ 148 │ │ 149 │ │ 150 def _rebuild_tensor_v2( │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trying to resize storage that is not resizable

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:419 in load_state_dict │ │ │ │ 416 │ except Exception as e: │ │ 417 │ │ try: │ │ 418 │ │ │ with open(checkpoint_file) as f: │ │ ❱ 419 │ │ │ │ if f.read(7) == "version": │ │ 420 │ │ │ │ │ raise OSError( │ │ 421 │ │ │ │ │ │ "You seem to have cloned a repository without │ │ 422 │ │ │ │ │ │ "git-lfs and run git lfs install followed b │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/codecs.py:322 in decode │ │ │ │ 319 │ def decode(self, input, final=False): │ │ 320 │ │ # decode input (taking the buffer into account) │ │ 321 │ │ data = self.buffer + input │ │ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, f │ │ 323 │ │ # keep undecoded input until the next call │ │ 324 │ │ self.buffer = data[consumed:] │ │ 325 │ │ return result │ ╰──────────────────────────────────────────────────────────────────────────────╯ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128: invalid start byte

During handling of the above exception, another exception occurred:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /data/langping/stanford_alpaca/chat.py:44 in │ │ │ │ 41 │ generator = model.generate │ │ 42 │ │ 43 │ │ ❱ 44 load_model("output_13b/checkpoint-46000") │ │ 45 │ │ 46 First_chat = "" │ │ 47 history = [] │ │ │ │ /data/langping/stanford_alpaca/chat.py:28 in load_model │ │ │ │ 25 │ print('gpu_count', gpu_count) │ │ 26 │ │ │ 27 │ tokenizer = transformers.LlamaTokenizer.from_pretrained(model_name │ │ ❱ 28 │ model = transformers.AutoModelForCausalLM.from_pretrained( │ │ 29 │ │ model_name, │ │ 30 │ │ # device_map=device_map, │ │ 31 │ │ # device_map="auto", │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/models/auto/auto_factory.py:471 in from_pretrained │ │ │ │ 468 │ │ │ ) │ │ 469 │ │ elif type(config) in cls._model_mapping.keys(): │ │ 470 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │ │ ❱ 471 │ │ │ return model_class.from_pretrained( │ │ 472 │ │ │ │ pretrained_model_name_or_path, *model_args, config=con │ │ 473 │ │ │ ) │ │ 474 │ │ raise ValueError( │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:2643 in from_pretrained │ │ │ │ 2640 │ │ │ │ mismatched_keys, │ │ 2641 │ │ │ │ offload_index, │ │ 2642 │ │ │ │ error_msgs, │ │ ❱ 2643 │ │ │ ) = cls._load_pretrained_model( │ │ 2644 │ │ │ │ model, │ │ 2645 │ │ │ │ state_dict, │ │ 2646 │ │ │ │ loaded_state_dict_keys, # XXX: rename? │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:2952 in _load_pretrained_model │ │ │ │ 2949 │ │ │ │ # Skip the load for shards that only contain disk-off │ │ 2950 │ │ │ │ if shard_file in disk_only_shard_files: │ │ 2951 │ │ │ │ │ continue │ │ ❱ 2952 │ │ │ │ state_dict = load_state_dict(shard_file) │ │ 2953 │ │ │ │ │ │ 2954 │ │ │ │ # Mistmatched keys contains tuples key/shape1/shape2 │ │ 2955 │ │ │ │ # matching the weights in the model. │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:431 in load_state_dict │ │ │ │ 428 │ │ │ │ │ │ "model. Make sure you have saved the model pr │ │ 429 │ │ │ │ │ ) from e │ │ 430 │ │ except (UnicodeDecodeError, ValueError): │ │ ❱ 431 │ │ │ raise OSError( │ │ 432 │ │ │ │ f"Unable to load weights from pytorch checkpoint file │ │ 433 │ │ │ │ f"at '{checkpoint_file}'. " │ │ 434 │ │ │ │ "If you tried to load a PyTorch model from a TF 2.0 c │ ╰──────────────────────────────────────────────────────────────────────────────╯ OSError: Unable to load weights from pytorch checkpoint file for 'output_13b/checkpoint-46000/pytorch_model-00005-of-00006.bin' at 'output_13b/checkpoint-46000/pytorch_model-00005-of-00006.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

xulangping avatar May 04 '23 10:05 xulangping

@xulangping thanks for the feedback! yes, it could be OOM. How much VRAM does your GPU have?

ssemeniuta avatar May 10 '23 08:05 ssemeniuta

The problem is fixed by --save_safetensors to train.sh.

ssemeniuta avatar May 12 '23 12:05 ssemeniuta

The problem is fixed by --save_safetensors to train.sh.

Hi, I encountered the same issue. However, I cannot find the right place to add the safetensor code. Could you share your training script?

zlkqlyc avatar Jun 03 '23 12:06 zlkqlyc

The problem is fixed by --save_safetensors to train.sh.

I have the same issue and it returns error when I add --save_safetensors. ValueError: Some specified arguments are not used by the HfArgumentParser: ['--save_safetensors']

foreverpiano avatar Jun 12 '23 14:06 foreverpiano

I am having the same problem. Any luck in solving it?

Mehrnoom avatar Jun 16 '23 07:06 Mehrnoom

does who solve the promblem?

alphanlp avatar Jun 16 '23 11:06 alphanlp