Latest Docker image fails while initializing gemma2
System Info
I tried the following systems, both with the same exception:
- ghcr.io/huggingface/text-generation-inference:sha-6aebf44 locally with docker on nvidia rtx 3600
- ghcr.io/huggingface/text-generation-inference:sha-6aebf44 on kubernetes cluster with nvidia a40
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-generation-inference:sha-6aebf44 --model-id google/gemma-2-9b-it
2024-07-22T15:30:59.895904Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-07-22T15:30:59.897225Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-22T15:31:09.917300Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-07-22T15:31:10.682538Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
raise KeyError(key)
KeyError: 'gemma2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
return FlashCausalLM(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
config = config_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
2024-07-22T15:31:11.520122Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-07-22 15:31:01.561 | INFO | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 951, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 653, in __getitem__
raise KeyError(key)
KeyError: 'gemma2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 749, in get_model
return FlashCausalLM(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 878, in __init__
config = config_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 953, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
rank=0
Error: ShardCannotStart
2024-07-22T15:31:11.616687Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-22T15:31:11.616777Z INFO text_generation_launcher: Shutting down shards
Using text-generation-inference:2.1.1, it works correctly, even though both share the same transformers version.
Expected behavior
initializing the model correctly
Thanks for reporting this 👍
There were some issues with gemma. I think this patch might address this as well.
Could you confirm if this is the case?
I now get a different error, both in the latest image and in 2.2.0.
2024-08-08T14:13:11.002710Z INFO text_generation_launcher: Args {
model_id: "google/gemma-2-9b-it",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "text-generation-inference2-755f5778bf-k9b86",
port: 8080,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
disable_usage_stats: false,
disable_crash_reports: false,
}
2024-08-08T14:13:11.002898Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:13:11.175329Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-08-08T14:13:11.175401Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-08-08T14:13:11.175410Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-08-08T14:13:11.175416Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-08-08T14:13:11.175944Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-2-9b-it
2024-08-08T14:13:14.195643Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-08-08T14:13:15.387056Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-2-9b-it
2024-08-08T14:13:15.387857Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-08-08T14:13:25.483545Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:35.578015Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:45.585828Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:13:55.588556Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:05.686872Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:15.778758Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:25.780629Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:35.785662Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-08-08T14:14:40.223288Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-08-08T14:14:40.277449Z INFO shard-manager: text_generation_launcher: Shard ready in 84.886492494s rank=0
2024-08-08T14:14:40.283759Z INFO text_generation_launcher: Starting Webserver
2024-08-08T14:14:40.320297Z INFO text_generation_router: router/src/main.rs:228: Using the Hugging Face API
2024-08-08T14:14:40.320346Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-08-08T14:14:40.839350Z INFO text_generation_router: router/src/main.rs:577: Serving revision 4efc01a1a58107f8c7f68027f5d8e475dfc34a6f of model google/gemma-2-9b-it
2024-08-08T14:14:41.433827Z INFO text_generation_router: router/src/main.rs:357: Using config Some(Gemma2)
2024-08-08T14:14:41.433853Z WARN text_generation_router: router/src/main.rs:384: Invalid hostname, defaulting to 0.0.0.0
2024-08-08T14:14:41.439120Z INFO text_generation_router::server: router/src/server.rs:1572: Warming up model
2024-08-08T14:14:42.767141Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
2024-08-08T14:14:42.772236Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1196, in warmup
self.cuda_graph_warmup(bs, max_s, max_bt)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1051, in cuda_graph_warmup
self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 490, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 427, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 355, in forward
attn_output = self.self_attn(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_gemma2_modeling.py", line 254, in forward
attn_output = paged_attention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py", line 115, in paged_attention
raise RuntimeError("Paged attention doesn't support softcapping")
RuntimeError: Paged attention doesn't support softcapping
2024-08-08T14:14:42.995889Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-08-08T14:14:43.119606Z ERROR text_generation_launcher: Webserver Crashed
2024-08-08T14:14:43.119690Z INFO text_generation_launcher: Shutting down shards
Ah it seem like this one doesn't have softcapping (https://github.com/huggingface/text-generation-inference/pull/2273). I'd recommend using the latest TGI version.
Would that work for you?
Ah it seem like this one doesn't have softcapping (#2273). I'd recommend using the latest TGI version.
Would that work for you?
Can you take a look at #2763 ?