text-generation-inference
text-generation-inference copied to clipboard
Server error when running witout GPUs: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
System Info
Deploying server as docker image in machine without GPU. Invocation of generation endpoint produces error: "error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
Deployed as docker image (tested with several models): model=${1:-bigscience/bloom-560m} num_shard=2 volume=/data/hf
docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard
curl 127.0.0.1:80/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json' {"error":"Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor","error_type":"generation"}
Logs:
2023-05-29T17:09:02.268570Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 157, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 63, in Prefill
generations, next_batch = self.model.generate_token(batch)
File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 542, in generate_token
logits, past = self.forward(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/gpt_neox.py", line 231, in forward
outputs = self.model.forward(
File "/usr/src/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 771, in forward
outputs = self.gpt_neox(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 653, in forward
outputs = layer(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 410, in forward
attention_layer_outputs = self.attention(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/src/transformers/src/transformers/models/gpt_neox/modeling_gpt_neox.py", line 213, in forward
attn_output, present, attn_weights = fused_attention_cuda.forward(
RuntimeError: attention_scores_2d must be a CUDA tensor
rank=1
2023-05-29T17:09:02.268916Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: attention_scores_2d must be a CUDA tensor
2023-05-29T17:09:02.270918Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.29.0 otel.kind=server trace_id=28d951851ea1436f963a8e4fc9ba754d}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 17, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, seed: None }}:generate{request=GenerateRequest { inputs: "What is Deep Learning?", parameters: GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 17, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "What is Deep Learning?", parameters: GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 17, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: attention_scores_2d must be a CUDA tensor
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Deploy server: model=bigscience/bloom-560m num_shard=2 volume=/data/hf
docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard
Call endpoint: curl 127.0.0.1:80/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json'
Expected behavior
Expect 200OK and JSON response.
add option --disable-custom-kernels . docker run --name hf-server -d --shm-size 1g -p 80:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.7 --model-id $model --num-shard $num_shard --disable-custom-kernels
see --disable-custom-kernels For some models (like bloom), text-generation-inference implemented custom cuda kernels to speed up inference. Those kernels were only tested on A100. Use this flag to disable them if you're running on different hardware and encounter issues
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.