text-generation-inference gemma-7b warmup encountered an error

System Info

Hi, I have encountered an warmup error when using the newst main branch to compile and start up gemma-7b model, the error like this: Traceback (most recent call last): File "/usr/local//bin/text-generation-server", line 8, in sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/src/text-generation-inference-main/server/text_generation_server/cli.py", line 118, in serve server.serve( File "/usr/src/text-generation-inference-main/server/text_generation_server/server.py", line 297, in serve asyncio.run( File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/usr/src/text-generation-inference-main/server/text_generation_server/interceptor.py", line 21, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/text-generation-inference-main/server/text_generation_server/server.py", line 125, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1096, in warmup _, batch, _ = self.generate_token(batch) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1371, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1296, in forward logits, speculative_logits = self.model.forward( File "/usr/src/text-generation-inference-main/server/text_generation_server/models/custom_modeling/flash_gemma_modeling.py", line 474, in forward logits, speculative_logits = self.lm_head(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/speculative.py", line 51, in forward logits = self.head(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/tensor_parallel.py", line 87, in forward return super().forward(input) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/tensor_parallel.py", line 37, in forward return self.linear.forward(x) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/linear.py", line 37, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) 2024-07-21T12:44:09.788954Z ERROR warmup{max_input_length=4096 max_prefill_tokens=20000 max_total_tokens=8192 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED Error: WebServer(Warmup(Generation("CANCELLED"))) 2024-07-21T12:44:14.909514Z ERROR text_generation_launcher: Webserver Crashed 2024-07-21T12:44:14.909530Z INFO text_generation_launcher: Shutting down shards 2024-07-21T12:44:14.993505Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2024-07-21T12:44:14.993672Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2024-07-21T12:44:15.494334Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 Error: WebserverFailed text_generation_launcher exit 1 How to solve it? Thanks.

Information

[ ] Docker
[X] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

text_generation_launcher_pid=591 2024-07-21T12:43:52.574813Z INFO text_generation_launcher: Args { model_id: "/dataset/model/gemma-7b-it/", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 5000, max_best_of: 1, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 4096, ), max_total_tokens: Some( 8192, ), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some( 20000, ), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "chat-tianrui-medusa2-master-0", port: 31471, shard_uds_path: "/tmp/text-generation-server", master_addr: "chat-tianrui-medusa2-master-0", master_port: 23456, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.95, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }

Expected behavior

expected the inference service to start normally.

Jul 21 '24 12:07 Amanda-Barbara

Hi @Amanda-Barbara, thanks for reporting the issue 👍

This error line:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

makes me suspect that there might be something off with the installation 🤔

Could you confirm if it works when running in a docker container? For example running:

model=google/gemma-7b
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id $model

Jul 22 '24 09:07 ErikKaum

@ErikKaum Hi, I will have try it, but I found that the text-generation-sever-v2.1.1 is starting and running ok, the text-generation-server-newst-main would report above error with the same environment.

Jul 22 '24 14:07 Amanda-Barbara

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 22 '24 01:08 github-actions[bot]