text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Parallel requests are failing (index 1 is out of bounds for dimension 0 with size 1)

Open borisrevzin opened this issue 1 year ago • 5 comments
trafficstars

System Info

Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.75.0 Commit sha: a9ea60684b6445b2507e147c6aeed0edb0b25eb7 Docker label: N/A nvidia-smi: Tue Jan 30 02:57:01 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA H100 PCIe On | 00000000:17:00.0 Off | Off | | N/A 46C P0 80W / 350W | 62254MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA H100 PCIe On | 00000000:2A:00.0 Off | Off | | N/A 37C P0 43W / 350W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA H100 PCIe On | 00000000:3D:00.0 Off | Off | | N/A 56C P0 98W / 350W | 623MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA H100 PCIe On | 00000000:63:00.0 Off | Off | | N/A 38C P0 76W / 350W | 623MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA H100 PCIe On | 00000000:AB:00.0 Off | 0 | | N/A 40C P0 52W / 350W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA H100 PCIe On | 00000000:BD:00.0 Off | 0 | | N/A 42C P0 75W / 350W | 623MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA H100 PCIe On | 00000000:CF:00.0 Off | 1 | | N/A 39C P0 50W / 350W | 7MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA H100 PCIe On | 00000000:E1:00.0 Off | 0 | | N/A 51C P0 96W / 350W | 623MiB / 81559MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 45620 C python 31366MiB | | 0 N/A N/A 121941 C python 30870MiB | +---------------------------------------------------------------------------------------+ OS version: CentOS Linux release 7.9 Rust version: cargo 1.75.0 (1d8b05cdd 2023-11-20) Model being used: octocoder Deployment specificities: N/A

Information

  • [ ] Docker
  • [X] The CLI directly Tested with docker and works as expected

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

We cannot use the launcher only, we must run the server and router separately, since we compile the server into an executable.

  1. invoking the server: cd /text-generation-inference-latest/ python ./server/text_generation_server/cli.py serve --no-sharded bigcode/octocoder
  2. invoking the router: cd ./router cargo run -- --port=8283 --tokenizer-name=bigcode/octocoder
  3. Inference A. Providing a single inference request works as expected curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8283/generate B. Providing 2 inference requests does not work, see error below curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8283/generate & curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8283/generate &

The error and stack trace:

Method Decode encountered an error. Traceback (most recent call last): File "/server/text_generation_server/cli.py", line 331, in app() File "~/.local/lib/python3.11/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/anaconda3/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "~/.local/lib/python3.11/site-packages/typer/core.py", line 778, in main return _main( File "~/.local/lib/python3.11/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/anaconda3/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/anaconda3/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/anaconda3/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "~/.local/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/server/text_generation_server/cli.py", line 89, in serve server.serve( File "/server/text_generation_server/server.py", line 235, in serve asyncio.run( File "/anaconda3/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/anaconda3/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/anaconda3/lib/python3.11/asyncio/base_events.py", line 640, in run_until_complete self.run_forever() File "/anaconda3/lib/python3.11/asyncio/base_events.py", line 607, in run_forever self._run_once() File "/anaconda3/lib/python3.11/asyncio/base_events.py", line 1922, in _run_once handle._run() File "/anaconda3/lib/python3.11/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/anaconda3/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept( File "/server/text_generation_server/interceptor.py", line 21, in intercept return await response File "/anaconda3/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/anaconda3/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/server/text_generation_server/server.py", line 146, in Decode batch = self.model.batch_type.concatenate(batches) File "/anaconda3/lib/python3.11/contextlib.py", line 81, in inner return func(*args, **kwds) File "/server/text_generation_server/models/causal_lm.py", line 381, in concatenate _, num_heads, padded_sequence_length, head_dim = first_past_kvs[0][1].shape IndexError: index 1 is out of bounds for dimension 0 with size 1 2024-01-30T12:25:17.836897Z ERROR batch{batch_size=2}:decode:decode{size=2}:decode{size=2}: text_generation_client: router/client/src/lib.rs:33: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-01-30T12:25:17.838231Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-01-30T12:25:17.838316Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1

Expected behavior

One successful response for each parallel inference request The expected response:

% INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="113.751839ms" validation_time="562.887µs" queue_time="158.385µs" inference_time="113.030915ms" time_per_token="22.606183ms" seed="Some(14147947438135544300)"}: text_generation_router::server: router/src/server.rs:298: Success

% INFO generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="113.751839ms" validation_time="562.887µs" queue_time="158.385µs" inference_time="113.030915ms" time_per_token="22.606183ms" seed="Some(14147947438135544300)"}: text_generation_router::server: router/src/server.rs:298: Success

borisrevzin avatar Jan 30 '24 13:01 borisrevzin

same error encountered when using starchat-beta model.

nullxjx avatar Feb 29 '24 08:02 nullxjx

Seeing this as well, with bigcode/starcoder, launched with:

text-generation-launcher --model-id "bigcode/starcoder" --quantize "bitsandbytes" -p 21042 --max-input-length 4095 --max-total-tokens 4096

Local Install.

2024-03-06T08:21:57.071276Z ERROR compat_generate{default_return_full_text=true compute_type=Extension(ComputeType("1-nvidia-a100-sxm4-80gb"))}:generate{parameters=GenerateParameters { best_of: Some(2), temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.9), typical_p: None, do_sample: true, max_new_tokens: Some(128), return_full_text: Some(false), stop: ["\n\n", "#", "]"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate_best_of{best_of=2}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:705: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1

kingb12 avatar Mar 06 '24 08:03 kingb12

Cannot reproduce on our end.

Can you reproduce with the docker image ? Environment and dependency can impact what's happening.

Also are you all running on main ?

Narsil avatar Mar 18 '24 15:03 Narsil

Reproduced with the docker: borisr@host:~ % docker run -it --entrypoint /bin/bash ghcr.io/huggingface/text-generation-inference:1.4 root@1ff8906676d2:/usr/src# python ./server/text_generation_server/cli.py serve --no-sharded bigcode/octocoder & root@1ff8906676d2:/usr/src# text-generation-router --port=8284 --tokenizer-name=bigcode/octocoder & root@1ff8906676d2:/usr/src# curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8284/generate &
[3] 2510
root@1ff8906676d2:/usr/src# curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8284/generate &
[4] 2512
root@1ff8906676d2:/usr/src# curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8284/generate &
[5] 2514
root@1ff8906676d2:/usr/src# curl -i -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"inputs":"<fim_prefix><fim_suffix><fim_middle>","parameters":{"stop":["endmodule"],"temperature":1.0,"top_p":0.5,"top_k":40,"do_sample":true,"num_beams":8,"num_beam_groups":4,"repetition_penalty":1.1,"max_new_tokens":500,"min_new_tokens":3,"num_return_sequences":3}}' http://localhost:8284/generate &

The stack trace: Traceback (most recent call last): File "/usr/src/./server/text_generation_server/cli.py", line 331, in app() File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/usr/src/./server/text_generation_server/cli.py", line 89, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 146, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/causal_lm.py", line 381, in concatenate _, num_heads, padded_sequence_length, head_dim = first_past_kvs[0][1].shape IndexError: index 1 is out of bounds for dimension 0 with size 1 2024-03-18T22:23:05.247864Z ERROR batch{batch_size=4}:decode:decode{size=4}:decode{size=4}: text_generation_client: router/client/src/lib.rs:33: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-03-18T22:23:05.249042Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-03-18T22:23:05.249118Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-03-18T22:23:05.249151Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1 2024-03-18T22:23:05.249188Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(1.0), repetition_penalty: Some(1.1), top_k: Some(40), top_p: Some(0.5), typical_p: None, do_sample: true, max_new_tokens: Some(500), return_full_text: None, stop: ["endmodule"], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:682: Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1 HTTP/1.1 424 Failed Dependency HTTP/1.1 424 Failed Dependency content-type: application/json content-length: 138 content-type: application/json HTTP/1.1 424 Failed Dependency HTTP/1.1 424 Failed Dependency access-control-allow-origin: * content-type: application/json content-length: 138 content-type: application/json vary: origin content-length: 138 content-length: 138 access-control-allow-origin: * vary: origin vary: access-control-request-method access-control-allow-origin: * vary: access-control-request-method access-control-allow-origin: * vary: origin vary: access-control-request-method vary: access-control-request-headers vary: origin vary: access-control-request-headers vary: access-control-request-method vary: access-control-request-headers date: Mon, 18 Mar 2024 22:23:05 GMT

{"error":"Request failed during generation: Server error: index 1 is out of bounds for dimension 0 with size 1","error_type":"generation"}date: Mon, 18 Mar 2024 22:23:05 GMT

borisrevzin avatar Mar 18 '24 22:03 borisrevzin

I still cannot reproduce.

Can you try upgrading to 1.4.5 latest version ? Also, the error occurs in causal LM which is not supposed to happen, this model should be using the flash attention kernels, something is making flash not work which is quite odd on H100.

Can you also include the actual logs from the startup ?

As a side note, why are you lauching every process independantly instead of just using the regular CLI ?

Narsil avatar Apr 05 '24 07:04 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 06 '24 01:05 github-actions[bot]