text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Error on warmup JAIS model

Open AIvashov opened this issue 3 months ago • 4 comments

System Info

I have AWS instance: g5.12xlarge

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

sudo docker run --gpus all --shm-size 64g -p 8080:80 -v $volume:/data  ghcr.io/huggingface/text-generation-inference:1.4 --model-id brainiac-origin/jais-chat-30b-8bit --trust-remote-code --quantize bitsandbytes

Expected behavior

2024-03-25T13:21:54.376266Z  INFO text_generation_launcher: Args { model_id: "brainiac-origin/jais-chat-30b-8bit", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), speculate: None, dtype: None, trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, enable_cuda_graphs: false, hostname: "6af26331ef70", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-03-25T13:21:54.376302Z  WARN text_generation_launcher: `trust_remote_code` is set. Trusting that model `brainiac-origin/jais-chat-30b-8bit` do not contain malicious code.
2024-03-25T13:21:54.376399Z  INFO download: text_generation_launcher: Starting download process.
2024-03-25T13:21:59.102180Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-03-25T13:21:59.781784Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-03-25T13:21:59.782082Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-03-25T13:22:09.791734Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-25T13:22:19.800415Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-25T13:22:29.810400Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-25T13:22:39.820932Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-03-25T13:22:44.077307Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-03-25T13:22:44.125779Z  INFO shard-manager: text_generation_launcher: Shard ready in 44.342725238s rank=0
2024-03-25T13:22:44.224503Z  INFO text_generation_launcher: Starting Webserver
2024-03-25T13:22:44.692725Z  INFO text_generation_router: router/src/main.rs:181: Using the Hugging Face API
2024-03-25T13:22:44.692765Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-03-25T13:22:45.127789Z  INFO text_generation_router: router/src/main.rs:443: Serving revision 5200b63dc1ac35793b5158fdc2f33445203d8251 of model brainiac-origin/jais-chat-30b-8bit
2024-03-25T13:22:45.127814Z  INFO text_generation_router: router/src/main.rs:242: Using the Hugging Face API to retrieve tokenizer config
2024-03-25T13:22:45.133481Z  INFO text_generation_router: router/src/main.rs:291: Warming up model
2024-03-25T13:22:56.130420Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/logits_process.py", line 60, in __call__
    local_scores = warper(None, local_scores)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 449, in __call__
    sorted_logits, sorted_indices = torch.sort(scores, descending=False)
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 95, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/model.py", line 74, in warmup
    self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 638, in generate_token
    next_token_id, logprobs = next_token_chooser(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/tokens.py", line 95, in __call__
    scores, next_logprob = self.static_warper(scores)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/logits_process.py", line 57, in __call__
    with torch.cuda.graph(self.cuda_graph, pool=mempool):
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 197, in __exit__
    self.cuda_graph.capture_end()
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/graphs.py", line 88, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


2024-03-25T13:22:56.131417Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=2048 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Unexpected <class 'RuntimeError'>: captures_underway == 0 INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDACachingAllocator.cpp":2939, please report a bug to PyTorch. 
Error: Warmup(Generation("Unexpected <class 'RuntimeError'>: captures_underway == 0 INTERNAL ASSERT FAILED at \"/opt/conda/conda-bld/pytorch_1699449201336/work/c10/cuda/CUDACachingAllocator.cpp\":2939, please report a bug to PyTorch. "))
2024-03-25T13:22:56.192490Z ERROR text_generation_launcher: Webserver Crashed
2024-03-25T13:22:56.192518Z  INFO text_generation_launcher: Shutting down shards
2024-03-25T13:22:57.678429Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: WebserverFailed

AIvashov avatar Mar 25 '24 13:03 AIvashov

Same with 4 more different models. Any updates on these cuda and torch errors?

giyaseddin avatar Mar 25 '24 16:03 giyaseddin

Try to ensure your GPU driver configurations are installed properly. In my case I had a Problem that my Nvidia docker Toolkit wasn't configured properly.

giyaseddin avatar Apr 10 '24 22:04 giyaseddin

@AIvashov Text generation does not support a pre-quantized bitandBytes model. The same problem is discussed here: https://github.com/huggingface/text-generation-inference/issues/1728. You can use the original model and it works fine. The --quantize option indicates whether you want to quantize the model or not. Check this link for more information: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher.

Mohammad-Faris avatar Apr 14 '24 10:04 Mohammad-Faris