gemma-7b warmup encountered an error
System Info
Hi, I have encountered an warmup error when using the newst main branch to compile and start up gemma-7b model, the error like this:
Traceback (most recent call last):
File "/usr/local//bin/text-generation-server", line 8, in
File "/usr/src/text-generation-inference-main/server/text_generation_server/interceptor.py", line 21, in intercept return await response File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor raise error File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor return await behavior(request_or_iterator, context) File "/usr/src/text-generation-inference-main/server/text_generation_server/server.py", line 125, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1096, in warmup _, batch, _ = self.generate_token(batch) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1371, in generate_token out, speculative_logits = self.forward(batch, adapter_data) File "/usr/src/text-generation-inference-main/server/text_generation_server/models/flash_causal_lm.py", line 1296, in forward logits, speculative_logits = self.model.forward( File "/usr/src/text-generation-inference-main/server/text_generation_server/models/custom_modeling/flash_gemma_modeling.py", line 474, in forward logits, speculative_logits = self.lm_head(hidden_states) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/speculative.py", line 51, in forward logits = self.head(input) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/tensor_parallel.py", line 87, in forward return super().forward(input) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/tensor_parallel.py", line 37, in forward return self.linear.forward(x) File "/usr/src/text-generation-inference-main/server/text_generation_server/layers/linear.py", line 37, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling
cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)2024-07-21T12:44:09.788954Z ERROR warmup{max_input_length=4096 max_prefill_tokens=20000 max_total_tokens=8192 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED Error: WebServer(Warmup(Generation("CANCELLED"))) 2024-07-21T12:44:14.909514Z ERROR text_generation_launcher: Webserver Crashed 2024-07-21T12:44:14.909530Z INFO text_generation_launcher: Shutting down shards 2024-07-21T12:44:14.993505Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2024-07-21T12:44:14.993672Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2024-07-21T12:44:15.494334Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 Error: WebserverFailed text_generation_launcher exit 1 How to solve it? Thanks.
Information
- [ ] Docker
- [X] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
text_generation_launcher_pid=591 2024-07-21T12:43:52.574813Z INFO text_generation_launcher: Args { model_id: "/dataset/model/gemma-7b-it/", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 5000, max_best_of: 1, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some( 4096, ), max_total_tokens: Some( 8192, ), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some( 20000, ), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "chat-tianrui-medusa2-master-0", port: 31471, shard_uds_path: "/tmp/text-generation-server", master_addr: "chat-tianrui-medusa2-master-0", master_port: 23456, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.95, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }
Expected behavior
expected the inference service to start normally.
Hi @Amanda-Barbara, thanks for reporting the issue 👍
This error line:
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
makes me suspect that there might be something off with the installation 🤔
Could you confirm if it works when running in a docker container? For example running:
model=google/gemma-7b
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:2.1.1 --model-id $model
@ErikKaum Hi, I will have try it, but I found that the text-generation-sever-v2.1.1 is starting and running ok, the text-generation-server-newst-main would report above error with the same environment.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.