text-generation-inference
text-generation-inference copied to clipboard
Safe Tensor converting fails for LLaMa 13B and 30B
System Info
2023-06-15T04:27:53.010592Z INFO text_generation_launcher: Runtime environment: [30/661]
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: 5ce89059f8149eaf313c63e9ded4199670cd74bb
Docker label: sha-5ce8905
nvidia-smi:
Thu Jun 15 04:27:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.07 Driver Version: 515.65.07 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:10:00.0 Off | Off |
| N/A 34C P0 86W / 400W | 25302MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:16:00.0 Off | Off |
| N/A 30C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:49:00.0 Off | Off |
| N/A 31C P0 73W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4D:00.0 Off | Off |
| N/A 31C P0 71W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C5:00.0 Off | Off |
| N/A 34C P0 91W / 400W | 32900MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | Off |
| N/A 34C P0 92W / 400W | 33044MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:E3:00.0 Off | Off |
| N/A 33C P0 96W / 400W | 33044MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:E7:00.0 Off | Off |
| N/A 35C P0 89W / 400W | 32900MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
I used the official script to convert llama weights 7B~65B and get the following error when launching 13B and 30B only
I ran this command
CUDA_VISIBLE_DEVICES=2,3 TRANSFORMERS_CACHE=~/ckpts/cache/ FLASH_ATTENTION=1 text-generation-launcher --model-id ~/ckpts/llama-hf/30b/ --num-shard 2 --port XXXXX
2023-06-15T02:42:46.121030Z INFO text_generation_launcher: Args { model_id: "~/ckpts/llama-hf/30b/", revision: None, sharded: None, num_shard: Some(2), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 10208, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-15T02:42:46.121084Z INFO text_generation_launcher: Sharding model on 2 processes
2023-06-15T02:42:46.121251Z INFO text_generation_launcher: Starting download process.
2023-06-15T02:42:49.059320Z WARN download: text_generation_launcher: No safetensors weights found for model ~/ckpts/llama-hf/30b/ at revision None. Converting PyTorch weights to safetensors.
2023-06-15T02:42:49.059564Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00001-of-00007.bin to ~/ckpts/llama-hf/30b/model-00001-of
-00007.safetensors.
2023-06-15T02:46:40.369256Z INFO download: text_generation_launcher: Convert: [1/7] -- Took: 0:03:51.309267
2023-06-15T02:46:40.369344Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00002-of-00007.bin to ~/ckpts/llama-hf/30b/model-00002-of-00007.safetensors.
2023-06-15T02:50:24.190971Z INFO download: text_generation_launcher: Convert: [2/7] -- Took: 0:03:43.820986
2023-06-15T02:50:24.191183Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00003-of-00007.bin to ~/ckpts/llama-hf/30b/model-00003-of-00007.safetensors.
2023-06-15T02:54:06.621353Z INFO download: text_generation_launcher: Convert: [3/7] -- Took: 0:03:42.429557
2023-06-15T02:54:06.621511Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00004-of-00007.bin to ~/ckpts/llama-hf/30b/model-00004-of
-00007.safetensors.
11Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00004-of-00007.bin to ~/ckpts/llama-hf/30b/model-00004-of-00007.safetensors.
2023-06-15T02:57:45.265631Z INFO download: text_generation_launcher: Convert: [4/7] -- Took: 0:03:38.643727
2023-06-15T02:57:45.265740Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00005-of-00007.bin to ~/ckpts/llama-hf/30b/model-00005-of-00007.safetensors.
2023-06-15T03:01:38.281861Z INFO download: text_generation_launcher: Convert: [5/7] -- Took: 0:03:53.015641
2023-06-15T03:01:38.281982Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00006-of-00007.bin to ~/ckpts/llama-hf/30b/model-00006-of-00007.safetensors.
2023-06-15T03:03:43.722396Z INFO download: text_generation_launcher: Convert: [6/7] -- Took: 0:02:05.440018
2023-06-15T03:03:43.722865Z INFO download: text_generation_launcher: Convert ~/ckpts/llama-hf/30b/pytorch_model-00007-of-00007.bin to ~/ckpts/llama-hf/30b/model-00007-of-00007.safetensors.
2023-06-15T03:04:36.163221Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
convert_file(pt_file, sf_file)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 65, in convert_file
check_file_size(pt_file, sf_file)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 20, in check_file_size
raise RuntimeError(
RuntimeError: The file size different is more than 1%:
- ~/ckpts/llama-hf/30b/pytorch_model-00007-of-00007.bin: 5900895281
- ~/ckpts/llama-hf/30b/model-00007-of-00007.safetensors: 5687891896
Error: DownloadError
Expected behavior
The launch works for LLaMa 7B and 65B well so its confusing why it wont work for 13B and 30B
Facing the same issue as of yesterday.
Hmm the check_file_size
is pretty rough sanitation, the file might actually be OK but it's hard to tell without looking at the file.
You can try deactivating the check ? Remove the line 20 ?
Can you point to the actual model on the hub you're using too ? Because we don't have any issue with "official" checkpoints
In my case, we're using https://huggingface.co/ausboss/llama-30b-supercot Initially we had the issue with https://huggingface.co/OpenAssistant/oasst-sft-6-llama-30b-xor however with a restart it was working miraculously..
Line 65 is the line that actually needs to be removed in order to skip the check for the safetensor file size. And here is a temporary patch for Docker for anyone that wants to disable it:
Dockerfile:
FROM ghcr.io/huggingface/text-generation-inference:0.8.2
RUN sed '65d' /opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py | tee /opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py
@Narsil
I'm using the official LLaMa checkpoints that I converted using the official script.
The docker image I'm using is https://ghcr.io/huggingface/text-generation-inference:latest
and here's the environment I used for converting the llama ckpts:
- python 3.10
- torch 2.1
- transformers 4.30.1
- accelerate 0.20.3
- sentencepiece 0.1.99
- protobuf 3.20.0
The funny thing is the code works well with 7B and 65B models but fails for 13B and 30B
Hmm the
check_file_size
is pretty rough sanitation, the file might actually be OK but it's hard to tell without looking at the file.You can try deactivating the check ? Remove the line 20 ?
Throws a InvalidHeaderDeserialization. Loading the same files using LlamaForCausalLM works fine in a notebook.
The funny thing is the code works well with 7B and 65B models but fails for 13B and 30B
I converted 30B like 3 times today for quantization purposes without a hitch.. (I'm not on docker though)
Throws a InvalidHeaderDeserialization. Loading the same files using LlamaForCausalLM works fine in a notebook.
You're trying to load pickle
files (which would work for LLamaForCausalLM
) it seems
To reproduce the safetensors error with supercot:
git lfs install
git clone https://huggingface.co/ausboss/llama-30b-supercot
docker run -v ./llama-30b-supercot:/usr/src/llama-30b-supercot --gpus all --rm ghcr.io/huggingface/text-generation-inference:sha-5ce8905 --model-id "/usr/src/llama-30b-supercot" --quantize bitsandbytes --trust-remote-code
Note that the same behavior happens with the current latest tag as well as 0.8.2. It always fails for us on the first tensor, which is the smallest and is a pickle.
Interesting news. When I use a custom kernel (one that I built on my own instead of using the ghcr
ones) I get the same sanity check error (which is obvious). However, in my custom-built image, I could easily disable L65, and got no error in loading the model.
We just tried the latest docker image on the llama-30b-supercot model and we still get this error on the very first bin, stopping the conversion:
2023-06-19T08:20:19.093178Z INFO download: text_generation_launcher: Convert /usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin to /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors.
2023-06-19T08:20:19.513324Z ERROR text_generation_launcher: Download encountered an error: Traceback (most recent call last):
Error: DownloadError
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 151, in download_weights
utils.convert_files(local_pt_files, local_st_files)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 84, in convert_files
convert_file(pt_file, sf_file)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 65, in convert_file
check_file_size(pt_file, sf_file)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/convert.py", line 20, in check_file_size
raise RuntimeError(
RuntimeError: The file size different is more than 5%:
- /usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin: 537
- /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors: 48
After applying the patch to remove L65 and skip the difference check, the safetensors get created but then this error occurs when the model is loading:
2023-06-19T09:12:29.311089Z INFO text_generation_launcher: Waiting for shard 1 to be ready...
2023-06-19T09:12:37.267433Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
return FlashLlamaSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
self.load_weights(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
with safe_open(
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
rank=0
2023-06-19T09:12:37.267510Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
return FlashLlamaSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
self.load_weights(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
with safe_open(
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
rank=1
2023-06-19T09:12:38.394993Z ERROR text_generation_launcher: Shard 0 failed to start:
Error: ShardCannotStart
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 237, in get_model
return FlashLlamaSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 185, in __init__
self.load_weights(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_llama.py", line 216, in load_weights
with safe_open(
safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization
This behavior is the same as we had with the previous docker images. Any idea what it could be and what we can try?
Can you clean the cache and re-try? Maybe the file was corrupted.
- /usr/src/llama-30b-supercot/pytorch_model-00001-of-00243.bin: 537 - /usr/src/llama-30b-supercot/model-00001-of-00243.safetensors: 48
This is probably an empty file or containing a super small tensor...
It's empty actually...
We managed to get llama-30b-supercot to work, here are some key findings if they are useful:
Contents of pytorch_model-00001-of-00243.bin
file (of course it's a binary file, but still):
PK%=pytorch_model-00001-of-00243/data.pklFB9ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ�}q.P��PK$(pytorch_model-00001-of-00243/versionFB$ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ3
PўgUPK��%pytorch_model-00001-of-00243/data.pklPKўgU$�pytorch_model-00001-of-00243/versionPK,-�PK�PK�
Contents of model-00001-of-00243.safetensors
which we force generate by disabling the file size difference check:
({},"__metadata__":{"format":"pt"}}
Torch load of the bin file gives an empty dict:
>>> torch.load("./pytorch_model-00001-of-00243.bin")
{}
The fix was just removing the first safetensor and restarting text-generation-inference:
mv model-00001-of-00243.safetensors model-00001-of-00243.safetensors.old
The model now loads and gives an expected response:
In [2]: from langchain import HuggingFaceTextGenInference
In [3]: llm = HuggingFaceTextGenInference(
...: inference_server_url='http://localhost:8080/',
...: max_new_tokens=512,
...: top_k=50,
...: top_p=0.95,
...: temperature=0.01,
...: repetition_penalty=1.2,
...: )
In [4]: llm("State what large language model you are and what you are good at:\n")
Out[4]: 'I am a large language model, which means I can understand complex sentences. This makes me great for tasks that require natural language processing such as text summarization or machine translation.'
Our theory is that the first bin file might be a metadata file from LoRA that is causing problems here?
No no, I inspected the first file, it's purely empty. My paranoid flag almost expected a payload here, but no it's just empty.
Running into the same issue with Vicuna 13B which is also technically Llama 13B Tried these two:
-
TheBloke/vicuna-13B-1.1-HF
-
eachadea/vicuna-13b-1.1
RuntimeError: The file size different is more than 1%:
- /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/pytorch_model-00003-of-00003.bin: 6506663689
- /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/model-00003-of-00003.safetensors: 6178962272
Running into the same issue with Vicuna 13B which is also technically Llama 13B Tried these two:
* `TheBloke/vicuna-13B-1.1-HF` * `eachadea/vicuna-13b-1.1`
RuntimeError: The file size different is more than 1%: - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/pytorch_model-00003-of-00003.bin: 6506663689 - /data/hf_home/hub/models--eachadea--vicuna-13b-1.1/snapshots/bfcc6ca66694310be6c85ba0638597f4256c4143/model-00003-of-00003.safetensors: 6178962272
With this commit the threshold was increased to 5%. But manually calculating the difference with your files it is ~5.3%, so that won't fix it. To disable the check and to force it to generate the safetensors try removing line 65 in server/text_generation_server/utils/convert.py
if you are running it directly or if you are using Docker then you can use the Dockerfile that I posted above to extend and patch the image.