text-generation-inference Latest Docker image fails while initializing gemma2

System Info

I tried the following systems, both with the same exception:

ghcr.io/huggingface/text-generation-inference:sha-6aebf44 locally with docker on nvidia rtx 3600
ghcr.io/huggingface/text-generation-inference:sha-6aebf44 on kubernetes cluster with nvidia a40

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-generation-inference:sha-6aebf44 --model-id google/gemma-2-9b-it

2024-07-22T15:30:59.895904Z  INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-07-22T15:30:59.897225Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-22T15:31:09.917300Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-22T15:31:10.682538Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
    raise KeyError(key)
KeyError: 'gemma2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
    return FlashCausalLM(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
    config = config_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
2024-07-22T15:31:11.520122Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-07-22 15:31:01.561 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
    raise KeyError(key)

KeyError: 'gemma2'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
    return FlashCausalLM(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
    config = config_class.from_pretrained(

  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
    raise ValueError(

ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
 rank=0
Error: ShardCannotStart
2024-07-22T15:31:11.616687Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-22T15:31:11.616777Z  INFO text_generation_launcher: Shutting down shards

Using text-generation-inference:2.1.1, it works correctly, even though both share the same transformers version.

Expected behavior

initializing the model correctly

Jul 22 '24 15:07 jorado

Thanks for reporting this 👍

There were some issues with gemma. I think this patch might address this as well.

Could you confirm if this is the case?

Jul 23 '24 09:07 ErikKaum

I now get a different error, both in the latest image and in 2.2.0.

2024-08-08T14:13:11.002710Z  INFO text_generation_launcher: Args {
    model_id: "google/gemma-2-9b-it",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "text-generation-inference2-755f5778bf-k9b86",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    disable_usage_stats: false,
    disable_crash_reports: false,
}
2024-08-08T14:13:11.002898Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:13:11.175329Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-08-08T14:13:11.175401Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-08-08T14:13:11.175410Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-08-08T14:13:11.175416Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-08T14:13:11.175944Z  INFO download: text_generation_launcher: Starting check and download process for google/gemma-2-9b-it
2024-08-08T14:13:14.195643Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-08-08T14:13:15.387056Z  INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-08-08T14:13:15.387857Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-08T14:13:25.483545Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:35.578015Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:45.585828Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:55.588556Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:05.686872Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:15.778758Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:25.780629Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:35.785662Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:40.223288Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-08-08T14:14:40.277449Z  INFO shard-manager: text_generation_launcher: Shard ready in 84.886492494s rank=0
2024-08-08T14:14:40.283759Z  INFO text_generation_launcher: Starting Webserver
2024-08-08T14:14:40.320297Z  INFO text_generation_router: router/src/main.rs:228: Using the Hugging Face API
2024-08-08T14:14:40.320346Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:14:40.839350Z  INFO text_generation_router: router/src/main.rs:577: Serving revision 4efc01a1a58107f8c7f68027f5d8e475dfc34a6f of model google/gemma-2-9b-it
2024-08-08T14:14:41.433827Z  INFO text_generation_router: router/src/main.rs:357: Using config Some(Gemma2)
2024-08-08T14:14:41.433853Z  WARN text_generation_router: router/src/main.rs:384: Invalid hostname, defaulting to 0.0.0.0
2024-08-08T14:14:41.439120Z  INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-08-08T14:14:42.767141Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-08-08T14:14:42.772236Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1196, in warmup
    self.cuda_graph_warmup(bs, max_s, max_bt)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1051, in cuda_graph_warmup
    self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 490, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 427, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 355, in forward
    attn_output = self.self_attn(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 254, in forward
    attn_output = paged_attention(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py", line 115, in paged_attention
    raise RuntimeError("Paged attention doesn't support softcapping")
RuntimeError: Paged attention doesn't support softcapping
2024-08-08T14:14:42.995889Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-08-08T14:14:43.119606Z ERROR text_generation_launcher: Webserver Crashed
2024-08-08T14:14:43.119690Z  INFO text_generation_launcher: Shutting down shards

Aug 08 '24 14:08 jorado

Ah it seem like this one doesn't have softcapping (https://github.com/huggingface/text-generation-inference/pull/2273). I'd recommend using the latest TGI version.

Would that work for you?

Aug 09 '24 07:08 ErikKaum

Ah it seem like this one doesn't have softcapping (#2273). I'd recommend using the latest TGI version.

Would that work for you?

Can you take a look at #2763 ?

Nov 20 '24 14:11 SMAntony