tensorrtllm_backend Crashes for long context requests

Crashes for long context requests

Open Pernekhan opened this issue 11 months ago • 17 comments
trtllm crashes when I give long context requests within the max-input-length limits.
I believe it happens when total pending requests reach the max-num-tokens limit. But why it's not queuing requests instead of crashing?
Here is the crash log:
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fe15c26354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fe15c265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fe15c265741]
3       0x7fe15c3b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fe284521b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fe117705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fe117705ba9]
6       0x7fe1176db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fe1176db6af]
7       0x7fe1176dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fe1176dd320]
8       0x7fe15e147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fe15e14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fe15e14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fe15e11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fe15e12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fe4a944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fe4a944f253]
14      0x7fe4a91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fe4a91dfac3]
15      0x7fe4a9271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fe4a9271660]
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fb8a826354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb8a8265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fb8a8265741]
3       0x7fb8a83b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fb9e80dcb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fb863705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fb863705ba9]
6       0x7fb8636db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fb8636db6af]
7       0x7fb8636dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fb8636dd320]
8       0x7fb8aa147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fb8aa14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fb8aa14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fb8aa11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fb8aa12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fbbf424f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbbf424f253]
14      0x7fbbf3fdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbbf3fdfac3]
15      0x7fbbf4071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fbbf4071660]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7f1e7426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f1e74265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7f1e74265741]
3       0x7f1e743b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7f1fb0280b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7f1e2f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f1e2f705ba9]
6       0x7f1e2f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f1e2f6db6af]
7       0x7f1e2f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f1e2f6dd320]
8       0x7f1e76147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7f1e7614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7f1e7614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7f1e7611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7f1e7612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7f21c024f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f21c024f253]
14      0x7f21bffdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f21bffdfac3]
15      0x7f21c0071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f21c0071660]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[trt-mixtral-chat-0:3614237] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
Signal (15) received.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7efc6426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7efc64265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7efc64265741]
3       0x7efc643b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7efdb045fb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7efc1f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7efc1f705ba9]
6       0x7efc1f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7efc1f6db6af]
7       0x7efc1f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7efc1f6dd320]
8       0x7efc66147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7efc6614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7efc6614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7efc6611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7efc6612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7effc0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7effc0e4f253]
14      0x7effc0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7effc0bdfac3]
15      0x7effc0c71660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7effc0c71660]
Signal (6) received.
[trt-mixtral-chat-0:3614237] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[trt-mixtral-chat-0:3614237] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Traceback (most recent call last):
  File "/app/scripts/launch_triton_server.py", line 89, in run_cmd
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--log-verbose=3', '--log-file=triton_log.txt', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix1_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix2_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix3_', ':']' returned non-zero exit status 1.
cc: @kaiyux
Mar 21 '24 23:03 Pernekhan
tensorrtllm_backend tensorrtllm_backend copied to clipboard

Crashes for long context requests

tensorrtllm_backend
tensorrtllm_backend copied to clipboard