text-generation-inference
text-generation-inference copied to clipboard
Server error: transport error
System Info
We are deploying the model meta-llama/Meta-Llama-3.1-70B-Instruct with FP8 quantization and everything works perfectly for hours until the server crashes with this error:
2024-10-01T07:43:22.055987Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error 2024-10-01T07:43:22.079235Z ERROR batch{batch_size=1}:prefill:prefill{id=290 size=1}:prefill{id=290 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: transport error 2024-10-01T07:43:22.079640Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(290)}:clear_cache{batch_id=Some(290)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.079692Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(290)}:clear_cache{batch_id=Some(290)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.079731Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: transport error 2024-10-01T07:43:22.091302Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.091679Z ERROR batch{batch_size=1}:prefill:prefill{id=291 size=1}:prefill{id=291 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091729Z ERROR batch{batch_size=1}:prefill:prefill{id=291 size=1}:prefill{id=291 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091863Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(291)}:clear_cache{batch_id=Some(291)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.091975Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(291)}:clear_cache{batch_id=Some(291)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.092026Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.099645Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.100104Z ERROR batch{batch_size=1}:prefill:prefill{id=292 size=1}:prefill{id=292 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100184Z ERROR batch{batch_size=1}:prefill:prefill{id=292 size=1}:prefill{id=292 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100265Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(292)}:clear_cache{batch_id=Some(292)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100390Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(292)}:clear_cache{batch_id=Some(292)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.100446Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107356Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 7802 2024-10-01T07:43:22.107645Z ERROR batch{batch_size=1}:prefill:prefill{id=293 size=1}:prefill{id=293 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107707Z ERROR batch{batch_size=1}:prefill:prefill{id=293 size=1}:prefill{id=293 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107788Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(293)}:clear_cache{batch_id=Some(293)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107826Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(293)}:clear_cache{batch_id=Some(293)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.107851Z ERROR completions:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: Connection refused (os error 111) 2024-10-01T07:43:22.342814Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2024-09-30 10:33:54.667 | INFO | text_generation_server.utils.import_utils:torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
@custom_bwd
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727692497.984239 79 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache
[rank0]:[E1001 07:43:21.006829576 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d56fc2d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7d56fc281d10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7d56fc3aef08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7d56ac5eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7d56ac5efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7d56ac5f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7d56ac5f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7:
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1720538435607/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d56fc2d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7d56fc281d10 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7d56fc3aef08 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7d56ac5eabc6 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7d56ac5efde0 in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7d56ac5f6a9a in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7d56ac5f8edc in /opt/conda/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #7:
Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d56fc2d2f86 in /opt/conda/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1:
The error happens when using Docker, with the following command:
docker run --restart always --gpus '"device=0,1"' -d -e HUGGING_FACE_HUB_TOKEN="" --shm-size 10g -p 10000:80 -v /shared_models:/huggingface/cache ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --max-input-length 7800 --max-total-tokens 8000 --num-shard 2 --huggingface-hub-cache /huggingface/cache/hub --quantize fp8
We are using 2x NVIDIA H100 96Gb PCIe GPUs
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
-
Running the command:
docker run --restart always --gpus '"device=0,1"' -d -e HUGGING_FACE_HUB_TOKEN="" --shm-size 10g -p 10000:80 -v /shared_models:/huggingface/cache ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id meta-llama/Meta-Llama-3.1-70B-Instruct --max-input-length 7800 --max-total-tokens 8000 --num-shard 2 --huggingface-hub-cache /huggingface/cache/hub --quantize fp8
-
Wait for hours of usage until the crash
Expected behavior
The server should keep working