stanford_alpaca
stanford_alpaca copied to clipboard
Checkpoint fails to load after training
When trying to load a model from output directory of train.py I get
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 442, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "/usr/local/lib64/python3.8/site-packages/torch/serialization.py", line 712, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib64/python3.8/site-packages/torch/serialization.py", line 1049, in _load
result = unpickler.load()
File "/usr/local/lib64/python3.8/site-packages/torch/_utils.py", line 138, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/usr/local/lib64/python3.8/site-packages/torch/_utils.py", line 134, in _rebuild_tensor
return t.set_(storage._untyped(), storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 446, in load_state_dict
if f.read(7) == "version":
File "/usr/lib64/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "generate.py", line 102, in <module>
inference()
File "generate.py", line 39, in inference
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2795, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3109, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File "/usr/local/lib/python3.8/site-packages/transformers/modeling_utils.py", line 458, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for '../alpaca_output/pytorch_model-00003-of-00003.bin' at '../alpaca_output/pytorch_model-00003-of-00003.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Do I need to postprocess a checkpoint, e.g. covert using a script?
same problem
while use 7b, there is no problem , but when use 13b, same problem for 00005.bin, maybe oom? @ssemeniuta @thashim
Loading checkpoint shards: 67%|████████████ | 4/6 [00:27<00:13, 6.94s/it] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │ │ s/modeling_utils.py:415 in load_state_dict │ │ │ │ 412 │ │ │ ) │ │ 413 │ │ return safe_load_file(checkpoint_file) │ │ 414 │ try: │ │ ❱ 415 │ │ return torch.load(checkpoint_file, map_location="cpu") │ │ 416 │ except Exception as e: │ │ 417 │ │ try: │ │ 418 │ │ │ with open(checkpoint_file) as f: │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/seria │ │ lization.py:789 in load │ │ │ │ 786 │ │ │ │ │ │ return _load(opened_zipfile, map_location, _w │ │ 787 │ │ │ │ │ except RuntimeError as e: │ │ 788 │ │ │ │ │ │ raise pickle.UnpicklingError(UNSAFE_MESSAGE + │ │ ❱ 789 │ │ │ │ return _load(opened_zipfile, map_location, pickle_mod │ │ 790 │ │ if weights_only: │ │ 791 │ │ │ try: │ │ 792 │ │ │ │ return _legacy_load(opened_file, map_location, _weigh │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/seria │ │ lization.py:1131 in _load │ │ │ │ 1128 │ │ │ 1129 │ unpickler = UnpicklerWrapper(data_file, **pickle_load_args) │ │ 1130 │ unpickler.persistent_load = persistent_load │ │ ❱ 1131 │ result = unpickler.load() │ │ 1132 │ │ │ 1133 │ torch._utils._validate_loaded_sparse_tensors() │ │ 1134 │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/_util │ │ s.py:153 in _rebuild_tensor_v2 │ │ │ │ 150 def _rebuild_tensor_v2( │ │ 151 │ storage, storage_offset, size, stride, requires_grad, backward_hoo │ │ 152 ): │ │ ❱ 153 │ tensor = _rebuild_tensor(storage, storage_offset, size, stride) │ │ 154 │ tensor.requires_grad = requires_grad │ │ 155 │ # NB: This line exists only for backwards compatibility; the │ │ 156 │ # general expectation is that backward_hooks is an empty │ │ │ │ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/torch/_util │ │ s.py:147 in _rebuild_tensor │ │ │ │ 144 def rebuild_tensor(storage, storage_offset, size, stride): │ │ 145 │ # first construct a tensor with the correct dtype/device │ │ 146 │ t = torch.tensor([], dtype=storage.dtype, device=storage.untyped() │ │ ❱ 147 │ return t.set(storage.untyped(), storage_offset, size, stride) │ │ 148 │ │ 149 │ │ 150 def _rebuild_tensor_v2( │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Trying to resize storage that is not resizable
During handling of the above exception, another exception occurred:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /data/langping/miniconda3/envs/coati/lib/python3.9/site-packages/transformer │
│ s/modeling_utils.py:419 in load_state_dict │
│ │
│ 416 │ except Exception as e: │
│ 417 │ │ try: │
│ 418 │ │ │ with open(checkpoint_file) as f: │
│ ❱ 419 │ │ │ │ if f.read(7) == "version": │
│ 420 │ │ │ │ │ raise OSError( │
│ 421 │ │ │ │ │ │ "You seem to have cloned a repository without │
│ 422 │ │ │ │ │ │ "git-lfs and run git lfs install followed b │
│ │
│ /data/langping/miniconda3/envs/coati/lib/python3.9/codecs.py:322 in decode │
│ │
│ 319 │ def decode(self, input, final=False): │
│ 320 │ │ # decode input (taking the buffer into account) │
│ 321 │ │ data = self.buffer + input │
│ ❱ 322 │ │ (result, consumed) = self._buffer_decode(data, self.errors, f │
│ 323 │ │ # keep undecoded input until the next call │
│ 324 │ │ self.buffer = data[consumed:] │
│ 325 │ │ return result │
╰──────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 128:
invalid start byte
During handling of the above exception, another exception occurred:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /data/langping/stanford_alpaca/chat.py:44 in
@xulangping thanks for the feedback! yes, it could be OOM. How much VRAM does your GPU have?
The problem is fixed by --save_safetensors to train.sh.
The problem is fixed by --save_safetensors to train.sh.
Hi, I encountered the same issue. However, I cannot find the right place to add the safetensor code. Could you share your training script?
The problem is fixed by --save_safetensors to train.sh.
I have the same issue and it returns error when I add --save_safetensors. ValueError: Some specified arguments are not used by the HfArgumentParser: ['--save_safetensors']
I am having the same problem. Any luck in solving it?
does who solve the promblem?