text-generation-inference Custom model: RuntimeError: weight shared.weight does not exist

System Info

Tue Jul  4 16:51:59 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         Off| 00000000:21:00.0  On |                  N/A |
|  0%   51C    P8               51W / 390W|   1047MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1762      G   /usr/lib/xorg/Xorg                           24MiB |
|    0   N/A  N/A      2178      G   /usr/bin/gnome-shell                         83MiB |
|    0   N/A  N/A      3994      G   /usr/lib/xorg/Xorg                          451MiB |
|    0   N/A  N/A      4140      G   /usr/bin/gnome-shell                         50MiB |
|    0   N/A  N/A      4827      G   ...,WinRetrieveSuggestionsOnlyOnDemand       65MiB |
|    0   N/A  N/A      5061      G   ...9470975,14709274054277858675,262144       96MiB |
|    0   N/A  N/A     35735      G   /snap/thunderbird/339/thunderbird-bin        87MiB |
|    0   N/A  N/A     36507      G   ...sion,SpareRendererForSitePerProcess       40MiB |
|    0   N/A  N/A     42817      G   ...ures=SpareRendererForSitePerProcess       36MiB |
|    0   N/A  N/A     47573      G   ...ures=SpareRendererForSitePerProcess       92MiB |
|    0   N/A  N/A     67787      G   /usr/lib/firefox/firefox                     11MiB |
+---------------------------------------------------------------------------------------+

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

When launching TGI on custom model derived from lmsys/fastchat-t5-3b-v1.0 with the following command: docker run --rm --network none --gpus 0 -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-generation-inference:latest --model-id /data/fastchat-t5-3b-v1.0

I got the following error message:

latest: Pulling from huggingface/text-generation-inference
Digest: sha256:29019a087e64ce951a6c9ca3b17a6823dfd9d25eeb56ec06c08150516fd60f0b
Status: Image is up to date for ghcr.io/huggingface/text-generation-inference:latest
2023-07-04T14:50:00.870189Z  INFO text_generation_launcher: Args { model_id: "/data/fastchat-t5-3b-v1.0", revision: None, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-04T14:50:00.870282Z  INFO text_generation_launcher: Starting download process.
2023-07-04T14:50:02.000718Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-04T14:50:02.371986Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-07-04T14:50:02.372146Z  INFO text_generation_launcher: Starting shard 0
2023-07-04T14:50:04.072895Z  WARN shard-manager: text_generation_launcher: We're not using custom kernels.
 rank=0
2023-07-04T14:50:04.214047Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
    self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight shared.weight does not exist

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
    return T5Sharded(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
    model = T5ForConditionalGeneration(config, weights)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
    self.shared = TensorParallelEmbedding(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight encoder.embed_tokens.weight does not exist
 rank=0
2023-07-04T14:50:04.673754Z ERROR text_generation_launcher: Shard 0 failed to start
2023-07-04T14:50:04.673779Z ERROR text_generation_launcher: Traceback (most recent call last):

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1005, in __init__
    self.shared = TensorParallelEmbedding(prefix="shared", weights=weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight shared.weight does not exist


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 78, in serve
    server.serve(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 166, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 133, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 279, in get_model
    return T5Sharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/t5.py", line 61, in __init__
    model = T5ForConditionalGeneration(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/t5_modeling.py", line 1007, in __init__
    self.shared = TensorParallelEmbedding(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 280, in __init__
    weight = weights.get_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 73, in get_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 49, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight encoder.embed_tokens.weight does not exist


2023-07-04T14:50:04.673806Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Expected behavior

I'd like to run TGI on my custom model on my RTX-3090 GPU.

Jul 04 '23 14:07 Matthieu-Tinycoaching

Same problem with you when I'm using BLOOM combined with LoRA adapter, then i receive this error.

RuntimeError: weight word_embeddings.weight does not exist

I've tried with the original BLOOM but it does not happened.

Jul 05 '23 09:07 dat-browny

I got similar error when I load wizardcoder with quantize tag, without quantize everything is just fine.

RuntimeError: weight transformer.h.0.attn.c_attn.qweight does not exist

run: text-generation-launcher --model-id wizardcoder --sharded false --port 8080 --quantize gptq

Jul 06 '23 06:07 aliswel-mt

Same with a LoRA merged falcon.

Jul 06 '23 13:07 PitchboyDev

Happened to me as well, "fixed it" by reverting to the 0.8 version of the Docker container, so it seems 0.9 version specific.

Jul 11 '23 07:07 ckanaar

@ckanaar thanks for the advice. It works for me too.

Jul 11 '23 08:07 PitchboyDev

Hi @PitchboyDev - following up on this for deploying LORA merged Falcon models on to TGI. How did you manage to deploy the model by downgrading TGI to 0.8? When I deploy using 0.8 or 0.8.2 I get this error:

AttributeError: FlashRWForCausalLM has no attribute 'model'

However, when I use0.9.2 or 0.9.3 I get the same error as you:

RuntimeError: weight lm_head.weight does not exist.

Any insight on how you solved this? Thanks!

Jul 24 '23 16:07 rohan-pradhan

@rohan-pradhan which version of falcon do you have ? For us, we have used the 7b version and downgrade to version 0.8 did the trick. Maybe be a configuration problem ?

Jul 24 '23 20:07 PitchboyDev

@PitchboyDev - yes, we are using Falcon 7B too!

Jul 24 '23 20:07 rohan-pradhan

It may be a problem related to safe tensors and torch shared tensors: https://huggingface.co/docs/safetensors/torch_shared_tensors

Because I had a similar error when tying to manually save my model in safetensors using the save_file method:

RuntimeError:
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.word_embeddings.weight', 'lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

Now I save it with save_model, and TGI gives me this kind of error:

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight transformer.word_embeddings.weight does not exist

EDIT: Solved it for my by generation the safetensors with the transformers save_pretrained function, adding the parameter safe_serialization=True

model.save_pretrained(OUTPUTS_PATH, safe_serialization=True)

Sep 15 '23 21:09 Trapper4888

Thanks for sharing your solution !

Sep 19 '23 07:09 Narsil

Hello,

Same issue here, we are trying to run our custom model with TGI (https://huggingface.co/cmarkea/bloomz-560m-sft-chat). The model runs well with TGI up to version 0.8.*. Starting from 0.9.0 we get the same error

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 147, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 147, in get_model
    return BLOOMSharded(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/bloom.py", line 82, in __init__
    model = BloomForCausalLM(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/bloom_modeling.py", line 818, in __init__
    self.transformer = BloomModel(config, weights)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/bloom_modeling.py", line 609, in __init__
    self.word_embeddings = TensorParallelEmbedding(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 375, in __init__
    weight = weights.get_partial_sharded(f"{prefix}.weight", dim=0)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 77, in get_partial_sharded
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/weights.py", line 53, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight word_embeddings.weight does not exist
 rank=0
Error: ShardCannotStart
2023-09-27T12:59:44.433561Z ERROR text_generation_launcher: Shard 0 failed to start
2023-09-27T12:59:44.433595Z  INFO text_generation_launcher: Shutting down shards

Our weights have the following format "transformer.word_embeddings.weight" and not "word_embeddings.weight" as the error suggests.

So it looks like the base_model_prefix is not configured properly.

Would it be possible to set base_model_prefix="transformer" as default for BloomModels as it is done for BloomPretrainedModels ? Or is it possible to add a CLI arg to specify the weight prefix?

Looking forward to test the latest versions features :rocket: Thanks!

Sep 29 '23 13:09 nolwennz

text-generation-inference text-generation-inference copied to clipboard

Custom model: RuntimeError: weight shared.weight does not exist

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard