text-generation-inference Server error: transport error

Server error: transport error

Open ismael-dm opened this issue 5 months ago • 4 comments

System Info

We are deploying the model meta-llama/Meta-Llama-3.1-70B-Instruct with FP8 quantization and everything works perfectly for hours until the server crashes with this error:

2024-10-01T07:43:22.055987Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error 2024-10-01T07:43:22.079235Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error 2024-10-01T07:43:22.079640Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(290)}:clear_cache{batch_id=Some(290)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.079692Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(290)}:clear_cache{batch_id=Some(290)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.079731Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: transport error 2024-10-01T07:43:22.091302Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.091679Z ERROR batch{batch_size=1}:prefill:prefill{id=291 size=1}:prefill{id=291 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091729Z ERROR batch{batch_size=1}:prefill:prefill{id=291 size=1}:prefill{id=291 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091863Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(291)}:clear_cache{batch_id=Some(291)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091975Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(291)}:clear_cache{batch_id=Some(291)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.092026Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.099645Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.100104Z ERROR batch{batch_size=1}:prefill:prefill{id=292 size=1}:prefill{id=292 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100184Z ERROR batch{batch_size=1}:prefill:prefill{id=292 size=1}:prefill{id=292 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100265Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(292)}:clear_cache{batch_id=Some(292)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100390Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(292)}:clear_cache{batch_id=Some(292)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100446Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107356Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.107645Z ERROR batch{batch_size=1}:prefill:prefill{id=293 size=1}:prefill{id=293 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107707Z ERROR batch{batch_size=1}:prefill:prefill{id=293 size=1}:prefill{id=293 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107788Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(293)}:clear_cache{batch_id=Some(293)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107826Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(293)}:clear_cache{batch_id=Some(293)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107851Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.342814Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2024-09-30 10:33:54.667 | INFO | text_generation_server.utils.import_utils::75 - Detected system cuda /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @custom_bwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @custom_bwd WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1727692497.984239 79 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache [rank0]:[E1001 07:43:21.006829576 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d56fc2d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7d56fc281d10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7d56fc3aef08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7d56ac5eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7d56ac5efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7d56ac5f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7d56ac5f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #7: + 0xd3b75 (0x7d57057e0b75 in /opt/conda/bin/../lib/libstdc++.so.6) frame #8: + 0x94ac3 (0x7d5705984ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: clone + 0x44 (0x7d5705a15a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d56fc2d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: + 0xe3ec34 (0x7d56ac278c34 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3b75 (0x7d57057e0b75 in /opt/conda/bin/../lib/libstdc++.so.6) frame #3: + 0x94ac3 (0x7d5705984ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: clone + 0x44 (0x7d5705a15a04 in /lib/x86_64-linux-gnu/libc.so.6) rank=0 2024-10-01T07:43:22.342900Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0 2024-10-01T07:43:22.392453Z ERROR text_generation_launcher: Shard 0 crashed 2024-10-01T07:43:22.392498Z INFO text_generation_launcher: Terminating webserver 2024-10-01T07:43:22.392528Z INFO text_generation_launcher: Waiting for webserver to gracefully shutdown 2024-10-01T07:43:22.392977Z INFO text_generation_router::server: router/src/server.rs:2593: signal received, starting graceful shutdown 2024-10-01T07:43:22.492706Z INFO text_generation_launcher: webserver terminated 2024-10-01T07:43:22.492736Z INFO text_generation_launcher: Shutting down shards 2024-10-01T07:43:22.497028Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1 2024-10-01T07:43:22.497075Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1 2024-10-01T07:43:22.597350Z INFO shard-manager: text_generation_launcher: shard terminated rank=1 Error: ShardFailed

The error happens when using Docker, with the following command: docker run --restart always --gpus '"device=0,1"' -d -e HUGGING_FACE_HUB_TOKEN="" --shm-size 10g -p 10000:80 -v /shared_models:/huggingface/cache ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --max-input-length 7800 --max-total-tokens 8000 --num-shard 2 --huggingface-hub-cache /huggingface/cache/hub --quantize fp8

We are using 2x NVIDIA H100 96Gb PCIe GPUs

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Running the command: docker run --restart always --gpus '"device=0,1"' -d -e HUGGING_FACE_HUB_TOKEN="" --shm-size 10g -p 10000:80 -v /shared_models:/huggingface/cache ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --max-input-length 7800 --max-total-tokens 8000 --num-shard 2 --huggingface-hub-cache /huggingface/cache/hub --quantize fp8
Wait for hours of usage until the crash

Expected behavior

The server should keep working

Oct 01 '24 11:10 ismael-dm

text-generation-inference text-generation-inference copied to clipboard

Server error: transport error

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard