tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Crashes for long context requests
trtllm crashes when I give long context requests within the max-input-length limits.
I believe it happens when total pending requests reach the max-num-tokens limit. But why it's not queuing requests instead of crashing?
Here is the crash log:
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1 0x7fe15c26354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fe15c265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fe15c265741]
3 0x7fe15c3b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4 0x7fe284521b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5 0x7fe117705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fe117705ba9]
6 0x7fe1176db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fe1176db6af]
7 0x7fe1176dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fe1176dd320]
8 0x7fe15e147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9 0x7fe15e14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10 0x7fe15e14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11 0x7fe15e11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12 0x7fe15e12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13 0x7fe4a944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fe4a944f253]
14 0x7fe4a91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fe4a91dfac3]
15 0x7fe4a9271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fe4a9271660]
what(): [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1 0x7fb8a826354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7fb8a8265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fb8a8265741]
3 0x7fb8a83b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4 0x7fb9e80dcb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5 0x7fb863705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fb863705ba9]
6 0x7fb8636db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fb8636db6af]
7 0x7fb8636dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fb8636dd320]
8 0x7fb8aa147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9 0x7fb8aa14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10 0x7fb8aa14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11 0x7fb8aa11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12 0x7fb8aa12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13 0x7fbbf424f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbbf424f253]
14 0x7fbbf3fdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbbf3fdfac3]
15 0x7fbbf4071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fbbf4071660]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1 0x7f1e7426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7f1e74265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7f1e74265741]
3 0x7f1e743b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4 0x7f1fb0280b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5 0x7f1e2f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f1e2f705ba9]
6 0x7f1e2f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f1e2f6db6af]
7 0x7f1e2f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f1e2f6dd320]
8 0x7f1e76147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9 0x7f1e7614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10 0x7f1e7614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11 0x7f1e7611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12 0x7f1e7612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13 0x7f21c024f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f21c024f253]
14 0x7f21bffdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f21bffdfac3]
15 0x7f21c0071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f21c0071660]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[trt-mixtral-chat-0:3614237] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
Signal (15) received.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
what(): [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1 0x7efc6426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2 0x7efc64265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7efc64265741]
3 0x7efc643b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4 0x7efdb045fb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5 0x7efc1f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7efc1f705ba9]
6 0x7efc1f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7efc1f6db6af]
7 0x7efc1f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7efc1f6dd320]
8 0x7efc66147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9 0x7efc6614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10 0x7efc6614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11 0x7efc6611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12 0x7efc6612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13 0x7effc0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7effc0e4f253]
14 0x7effc0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7effc0bdfac3]
15 0x7effc0c71660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7effc0c71660]
Signal (6) received.
[trt-mixtral-chat-0:3614237] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[trt-mixtral-chat-0:3614237] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Traceback (most recent call last):
File "/app/scripts/launch_triton_server.py", line 89, in run_cmd
subprocess.run(cmd, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--log-verbose=3', '--log-file=triton_log.txt', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix1_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix2_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix3_', ':']' returned non-zero exit status 1.
cc: @kaiyux
Hey! You're right, it is expected to queue the requests. Can you share the engine build command please? Also, the test script or command if possible.
here is the engine build command:
trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4 --max_num_tokens 327680
This is just a simple script we used to make it crash
echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo
any updates for this @schetlur-nv ?
I will take a look at this.
I have encountered a similar problem. My backend server crashes when the request concurrency is high. I posted the scripts I used in this issue:
https://github.com/triton-inference-server/tensorrtllm_backend/issues/392
@Pernekhan Can you post the script that you use to make it crash? The link you provided is local to your machine.
@thorjohnsen here is the script and the request file attached.
echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo
Here is the file 8k-context-req.txt. You can also try any of your own 8k context requests.
8k-context-req.txt
Thank you @Pernekhan, can you provide models/ensemble/config.pbtxt? Also, I am not too familiar with curl, is models/ensemble/generate a script? If so, please provide it?
I used the configs from all_models/inflight_batcher_llm with batch_size 64.
Here is the script that does what curl is trying to do.
import requests
import concurrent.futures
# Define the URL
url = "http://localhost:8000/v2/models/ensemble/generate"
# Define the payload data file
payload_data_file = "8k-context-req.txt"
# Define the number of parallel requests
num_requests = 64
# Define a function to make the request
def make_request(url, data):
response = requests.post(url, data=data)
return response.text
# Load the payload data
with open(payload_data_file, 'rb') as file:
data = file.read()
# Function to make parallel requests
def make_parallel_requests(url, data, num_requests):
with concurrent.futures.ThreadPoolExecutor() as executor:
# Submit the requests
futures = [executor.submit(make_request, url, data) for _ in range(num_requests)]
# Wait for all requests to complete
for future in concurrent.futures.as_completed(futures):
try:
response = future.result()
print(response)
except Exception as e:
print(f"An error occurred: {e}")
# Make parallel requests
make_parallel_requests(url, data, num_requests)
Hi @thorjohnsen were you able to reproduce the issue?
I am sorry, I was OOTO for a few days. I will resume work on this issue now.
I can confirm that I am able to reproduce the issue. Now to find the cause.
I agree with @Pernekhan that crash likely happens when total pending requests reach the max-num-tokens limit. Server runs fine as long as number of parallel requests is low enough to not exceed limit.
I don't see a crash with LLama-v2-7b, so this issue might only affect MoE models.
A similar issue was reported internally by somebody at NVIDIA, and a fix is on the way. Daniel Stokes from our side will revisit this issue once that fix has been merged.
Any updates on this?
We using v0.10.0 with default BLS config from: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm
Same issue with 8K context requests.
8xA100 80G, model is: llama2 13B
It is much stable with KV Cahche reuse disabled, but significantly slower.
This is the error we getting with 8K request with KV Cache reuse enabled:
[TensorRT-LLM][ERROR] Encountered an error in forwardSync function: [TensorRT-LLM][ERROR] Assertion failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:572)
1 0x7f84f02692b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2 0x7f83fa32a9a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b89a0) [0x7f83fa32a9a0]
3 0x7f83fc27f0ee tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::releaseBlocks(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 574
4 0x7f83fc27f678 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::removeSequence(int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 296
5 0x7f83fc2abfa6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::terminateRequest(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&, bool) + 502
6 0x7f83fc2ae3cc tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderSync(tensorrt_llm::batch_manager::ScheduledRequests const&, std::unique_ptr<tensorrt_llm::runtime::decoder_batch::Token const, std::default_delete<tensorrt_llm::runtime::decoder_batch::Token const> > const&) + 1724