text-generation-inference ERROR shard-manager When run bigcode/starcoder

System Info

docker exec -it text-generation-inference text-generation-launcher --env

(base) ➜  huggingface-text-generation-inference docker exec -it 401ba897d58aa498e6fffa0e717144c47fea4cf56c0578fbb4b384b42bcf6040 text-generation-launcher --env
2023-06-03T03:36:08.324157Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Sat Jun  3 03:36:08 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
   |  0%   37C    P8    13W / 310W |    693MiB /  8192MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-03T03:36:08.324179Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: None, quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }

(base) ➜  huggingface-text-generation-inference curl 127.0.0.1:8080/info | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   455  100   455    0     0   444k      0 --:--:-- --:--:-- --:--:--  444k
{
  "model_id": "/data/bigcode/starcoder",
  "model_sha": null,
  "model_dtype": "torch.float32",
  "model_device_type": "cpu",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1000,
  "max_total_tokens": 1512,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 32000,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.8.2",
  "sha": "e7248fe90e27c7c8e39dd4cac5874eb9f96ab182",
  "docker_label": "sha-e7248fe"
}

Information

[x] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

1 ubuntu20.04

2 Start text-generation-interface with Docker

model=/data/bigcode/starcoder
num_shard=1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --disable-custom-kernels

3 Request with VSCODE extension

4 I get the followeing errors:

➜  huggingface-text-generation-inference docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --disable-custom-kernels
2023-06-03T03:33:15.272607Z  INFO text_generation_launcher: Args { model_id: "/data/bigcode/starcoder", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-03T03:33:15.272886Z  INFO text_generation_launcher: Starting download process.
2023-06-03T03:33:16.389565Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-03T03:33:16.775719Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-03T03:33:16.776087Z  INFO text_generation_launcher: Starting shard 0
2023-06-03T03:33:26.786743Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:36.797049Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:46.807792Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:56.818618Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:06.830109Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:16.839934Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:26.850552Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:36.861382Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:46.873280Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:56.885746Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:35:06.896503Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:35:12.065627Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-03T03:35:12.103705Z  INFO text_generation_launcher: Shard 0 ready in 115.326268544s
2023-06-03T03:35:12.191281Z  INFO text_generation_launcher: Starting Webserver
2023-06-03T03:35:12.271308Z  WARN text_generation_router: router/src/main.rs:158: no pipeline tag found for model /data/bigcode/starcoder
2023-06-03T03:35:12.276164Z  INFO text_generation_router: router/src/main.rs:178: Connected
2023-06-03T03:43:43.852322Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 61, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 575, in generate_token
    next_token_id, logprobs = next_token_chooser(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/tokens.py", line 71, in __call__
    scores, next_logprob = self.static_warper(scores)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/logits_process.py", line 47, in __call__
    self.cuda_graph = torch.cuda.CUDAGraph()
RuntimeError: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
 rank=0
2023-06-03T03:43:43.852597Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
2023-06-03T03:43:43.853127Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=192.168.1.9:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=node-fetch otel.kind=server trace_id=92dbf3a1bfd4c5408c7350b41e793129}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None }}:generate{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.

Expected behavior

Expected to get no error

Jun 03 '23 03:06 wpjscc

The model loaded on cpu for some reason. "model_device_type": "cpu", in the info.

Can you run docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env directly?

Jun 03 '23 09:06 OlivierDehaene

The model loaded on cpu for some reason. "model_device_type": "cpu", in the info.

Can you run docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env directly?

out

➜  huggingface-text-generation-inference docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-06T04:49:23.305612Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Tue Jun  6 04:49:23 2023
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
   |  0%   35C    P8    12W / 310W |    693MiB /  8192MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-06T04:49:23.305628Z  INFO text_generation_launcher: Args { model_id: "/data/bigcode/starcoder", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-06T04:49:23.305682Z  INFO text_generation_launcher: Starting download process.
2023-06-06T04:49:24.863179Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-06T04:49:25.208559Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-06T04:49:25.208691Z  INFO text_generation_launcher: Starting shard 0
2023-06-06T04:49:35.221147Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:49:45.232376Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:49:55.242637Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:05.252233Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:15.263142Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:25.274029Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:35.285107Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:45.296919Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:50:55.310651Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:51:05.327063Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:51:15.338597Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-06T04:51:17.484955Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-06T04:51:17.541604Z  INFO text_generation_launcher: Shard 0 ready in 112.33250655s
2023-06-06T04:51:17.620150Z  INFO text_generation_launcher: Starting Webserver
2023-06-06T04:51:17.695192Z  WARN text_generation_router: router/src/main.rs:158: no pipeline tag found for model /data/bigcode/starcoder
2023-06-06T04:51:17.701264Z  INFO text_generation_router: router/src/main.rs:178: Connected
2023-06-06T04:51:27.852421Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 61, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 575, in generate_token
    next_token_id, logprobs = next_token_chooser(
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/tokens.py", line 71, in __call__
    scores, next_logprob = self.static_warper(scores)
  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/logits_process.py", line 47, in __call__
    self.cuda_graph = torch.cuda.CUDAGraph()
RuntimeError: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
 rank=0
2023-06-06T04:51:27.852917Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
2023-06-06T04:51:27.854001Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=10.8.0.9:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=node-fetch otel.kind=server trace_id=2db5d9a0888127e8a0c4bcf9c769fc9b}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None }}:generate{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.

Jun 06 '23 04:06 wpjscc

I have the same probleme here , did you manage to find a solution ?

Nov 16 '23 16:11 theoeiferman

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 31 '24 01:07 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

ERROR shard-manager When run bigcode/starcoder

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard