llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

convert-pth-to-ggml.py failed with RuntimeError

Open KevinXuxuxu opened this issue 1 year ago • 3 comments

Hi there, I downloaded my LLaMa weights through bit-torrent, and tried to convert the 7B model to ggml FP16 format:

$python convert-pth-to-ggml.py models/7B/ 1 
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-06, 'vocab_size': 32000}
n_parts =  1
Processing part  0
Traceback (most recent call last):
  File "/Users/fzxu/Documents/code/llama.cpp/convert-pth-to-ggml.py", line 89, in <module>
    model = torch.load(fname_model, map_location="cpu")
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 1049, in _load
    result = unpickler.load()
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 1019, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/opt/anaconda3/envs/llama.cpp/lib/python3.10/site-packages/torch/serialization.py", line 997, in load_tensor
    storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()
RuntimeError: PytorchStreamReader failed reading file data/27: invalid header or archive is corrupted

Does this mean my downloaded version of model weights is corrupted? Or am I missing something? I have filed request to Meta and hopefully I can try again with data from official download source.

KevinXuxuxu avatar Mar 12 '23 05:03 KevinXuxuxu

what is "data/27" file, that is within your models/7B folder? you downloaded the wrong thing

G2G2G2G avatar Mar 12 '23 06:03 G2G2G2G

Here's the file structure of my downloaded model:

$ ls ./models 
7B                      tokenizer.model         tokenizer_checklist.chk
$ ls ./models/7B
checklist.chk       consolidated.00.pth params.json

There isn't a directory called data and this looks normal to me. As for the data/27 file, it seems to be some file structure within the pth file which seems to be zipped (making some guess by checking the pytorch serialization code: https://github.com/pytorch/pytorch/blob/master/torch/serialization.py#L1112)

KevinXuxuxu avatar Mar 12 '23 06:03 KevinXuxuxu

@KevinXuxuxu Can you post the hashes of the downloaded files?

on Linux:

sha256sum ./models/7B/*

on macOS:

shasum -a 256 ./models/7B/*

My hashes are:

7935c843a25ae265d60bf4543b90bfd91c4911b728412b5c1d5cff42a3cd5645  ./models/7B/checklist.chk
700df0d3013b703a806d2ae7f1bfb8e59814e3d06ae78be0c66368a50059f33d  ./models/7B/consolidated.00.pth
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265  ./models/7B/params.json

prusnak avatar Mar 12 '23 10:03 prusnak

@prusnak Thanks for providing the shasum for my validation!

$ shasum -a 256 ./models/7B/*
7935c843a25ae265d60bf4543b90bfd91c4911b728412b5c1d5cff42a3cd5645  ./models/7B/checklist.chk
008cfbd68936367b15a311494c8c8259c4902dbb461896ae767084372cdfa3fc  ./models/7B/consolidated.00.pth
7e89e242ddc0dd6f060b43ca219ce8b3e8f08959a72cb3c0855df8bb04d46265  ./models/7B/params.json

Indeed my consolidated.00.pth file is somewhat corrupted. May I ask how you get the data? From official Meta download or bit-torrent? Closing this comment while I try to get a correct version of the model weights.

KevinXuxuxu avatar Mar 12 '23 20:03 KevinXuxuxu

@prusnak Can you provide hashes for the 13B files?

prettydeep avatar Mar 14 '23 01:03 prettydeep

For anyone who has doubt about their data, try using https://github.com/cocktailpeanut/dalai which has the weights downloaded for you, and they seem to come from reliable source.

KevinXuxuxu avatar Mar 14 '23 06:03 KevinXuxuxu

Here's the file structure of my downloaded model:

$ ls ./models 
7B                      tokenizer.model         tokenizer_checklist.chk
$ ls ./models/7B
checklist.chk       consolidated.00.pth params.json

There isn't a directory called data and this looks normal to me. As for the data/27 file, it seems to be some file structure within the pth file which seems to be zipped (making some guess by checking the pytorch serialization code: https://github.com/pytorch/pytorch/blob/master/torch/serialization.py#L1112)

Can you please provide a link to download the LLaMA files

tanishhshahh avatar Mar 17 '23 22:03 tanishhshahh