text-generation-inference
text-generation-inference copied to clipboard
Custom model: RuntimeError: weight shared.weight does not exist
System Info
Tue Jul 4 16:51:59 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off| 00000000:21:00.0 On | N/A |
| 0% 51C P8 51W / 390W| 1047MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1762 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 2178 G /usr/bin/gnome-shell 83MiB |
| 0 N/A N/A 3994 G /usr/lib/xorg/Xorg 451MiB |
| 0 N/A N/A 4140 G /usr/bin/gnome-shell 50MiB |
| 0 N/A N/A 4827 G ...,WinRetrieveSuggestionsOnlyOnDemand 65MiB |
| 0 N/A N/A 5061 G ...9470975,14709274054277858675,262144 96MiB |
| 0 N/A N/A 35735 G /snap/thunderbird/339/thunderbird-bin 87MiB |
| 0 N/A N/A 36507 G ...sion,SpareRendererForSitePerProcess 40MiB |
| 0 N/A N/A 42817 G ...ures=SpareRendererForSitePerProcess 36MiB |
| 0 N/A N/A 47573 G ...ures=SpareRendererForSitePerProcess 92MiB |
| 0 N/A N/A 67787 G /usr/lib/firefox/firefox 11MiB |
+---------------------------------------------------------------------------------------+
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
When launching TGI on custom model derived from lmsys/fastchat-t5-3b-v1.0 with the following command:
docker run --rm --network none --gpus 0 -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-generation-inference:latest --model-id /data/fastchat-t5-3b-v1.0
I got the following error message:
latest: Pulling from huggingface/text-generation-inference
Digest: sha256:29019a087e64ce951a6c9ca3b17a6823dfd9d25eeb56ec06c08150516fd60f0b
Status: Image is up to date for ghcr.io/huggingface/text-generation-inference:latest
2023-07-04T14:50:00.870189Z INFO text_generation_launcher: Args { model_id: "/data/fastchat-t5-3b-v1.0", revision: None, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-04T14:50:00.870282Z INFO text_generation_launcher: Starting download process.
2023-07-04T14:50:02.000718Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-04T14:50:02.371986Z INFO text_generation_launcher: Successfully downloaded weights.
2023-07-04T14:50:02.372146Z INFO text_generation_launcher: Starting shard 0
2023-07-04T14:50:04.072895Z WARN shard-manager: text_generation_launcher: We're not using custom kernels.
rank=0
2023-07-04T14:50:04.214047Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
return T5Sharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
model = T5ForConditionalGeneration(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
self.shared = TensorParallelEmbedding(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
rank=0
2023-07-04T14:50:04.673754Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-04T14:50:04.673779Z ERROR text_generation_launcher: Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
server.serve(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
asyncio.run(
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
return T5Sharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
model = T5ForConditionalGeneration(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
self.shared = TensorParallelEmbedding(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
weight = weights.get_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
2023-07-04T14:50:04.673806Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Expected behavior
I'd like to run TGI on my custom model on my RTX-3090 GPU.
Same problem with you when I'm using BLOOM combined with LoRA adapter, then i receive this error.
RuntimeError: weight word_embeddings.weight does not exist
I've tried with the original BLOOM but it does not happened.
I got similar error when I load wizardcoder with quantize tag, without quantize everything is just fine.
RuntimeError: weight transformer.h.0.attn.c_attn.qweight does not exist
run:
text-generation-launcher --model-id wizardcoder --sharded false --port 8080 --quantize gptq
Same with a LoRA merged falcon.
Happened to me as well, "fixed it" by reverting to the 0.8 version of the Docker container, so it seems 0.9 version specific.
@ckanaar thanks for the advice. It works for me too.
Hi @PitchboyDev - following up on this for deploying LORA merged Falcon models on to TGI. How did you manage to deploy the model by downgrading TGI to 0.8? When I deploy using 0.8 or 0.8.2 I get this error:
AttributeError: FlashRWForCausalLM has no attribute 'model'
However, when I use0.9.2 or 0.9.3 I get the same error as you:
RuntimeError: weight lm_head.weight does not exist.
Any insight on how you solved this? Thanks!
@rohan-pradhan which version of falcon do you have ? For us, we have used the 7b version and downgrade to version 0.8 did the trick. Maybe be a configuration problem ?
@PitchboyDev - yes, we are using Falcon 7B too!
It may be a problem related to safe tensors and torch shared tensors: https://huggingface.co/docs/safetensors/torch_shared_tensors
Because I had a similar error when tying to manually save my model in safetensors using the save_file method:
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.word_embeddings.weight', 'lm_head.weight'}].
A potential way to correctly save your model is to use `save_model`.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
Now I save it with save_model, and TGI gives me this kind of error:
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight transformer.word_embeddings.weight does not exist
EDIT: Solved it for my by generation the safetensors with the transformers save_pretrained function, adding the parameter safe_serialization=True
model.save_pretrained(OUTPUTS_PATH, safe_serialization=True)
Thanks for sharing your solution !
Hello,
Same issue here, we are trying to run our custom model with TGI (https://huggingface.co/cmarkea/bloomz-560m-sft-chat). The model runs well with TGI up to version 0.8.*. Starting from 0.9.0 we get the same error
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 147, in get_model
return BLOOMSharded(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/bloom.py", line 82, in __init__
model = BloomForCausalLM(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/bloom_modeling.py", line 818, in __init__
self.transformer = BloomModel(config, weights)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/bloom_modeling.py", line 609, in __init__
self.word_embeddings = TensorParallelEmbedding(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 375, in __init__
weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 77, in get_partial_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight word_embeddings.weight does not exist
rank=0
Error: ShardCannotStart
2023-09-27T12:59:44.433561Z ERROR text_generation_launcher: Shard 0 failed to start
2023-09-27T12:59:44.433595Z INFO text_generation_launcher: Shutting down shards
Our weights have the following format "transformer.word_embeddings.weight" and not "word_embeddings.weight" as the error suggests.
So it looks like the base_model_prefix is not configured properly.
Would it be possible to set base_model_prefix="transformer" as default for BloomModels as it is done for BloomPretrainedModels ? Or is it possible to add a CLI arg to specify the weight prefix?
Looking forward to test the latest versions features :rocket: Thanks!