tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Crashes for long context requests

Open Pernekhan opened this issue 1 year ago • 17 comments

trtllm crashes when I give long context requests within the max-input-length limits.

I believe it happens when total pending requests reach the max-num-tokens limit. But why it's not queuing requests instead of crashing?

Here is the crash log:

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fe15c26354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fe15c265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fe15c265741]
3       0x7fe15c3b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fe284521b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fe117705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fe117705ba9]
6       0x7fe1176db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fe1176db6af]
7       0x7fe1176dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fe1176dd320]
8       0x7fe15e147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fe15e14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fe15e14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fe15e11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fe15e12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fe4a944f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fe4a944f253]
14      0x7fe4a91dfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fe4a91dfac3]
15      0x7fe4a9271660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fe4a9271660]
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7fb8a826354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7fb8a8265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7fb8a8265741]
3       0x7fb8a83b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7fb9e80dcb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7fb863705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7fb863705ba9]
6       0x7fb8636db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7fb8636db6af]
7       0x7fb8636dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7fb8636dd320]
8       0x7fb8aa147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7fb8aa14ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7fb8aa14e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7fb8aa11d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7fb8aa12452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7fbbf424f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fbbf424f253]
14      0x7fbbf3fdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fbbf3fdfac3]
15      0x7fbbf4071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7fbbf4071660]
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7f1e7426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f1e74265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7f1e74265741]
3       0x7f1e743b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7f1fb0280b8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7f1e2f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7f1e2f705ba9]
6       0x7f1e2f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7f1e2f6db6af]
7       0x7f1e2f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7f1e2f6dd320]
8       0x7f1e76147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7f1e7614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7f1e7614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7f1e7611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7f1e7612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7f21c024f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f21c024f253]
14      0x7f21bffdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f21bffdfac3]
15      0x7f21c0071660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f21c0071660]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[trt-mixtral-chat-0:3614237] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
Signal (15) received.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: elts_per_rank % elts_per_thread == 0 (/app/tensorrt_llm/cpp/tensorrt_llm/kernels/customAllReduceKernels.cu:324)
1       0x7efc6426354a tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7efc64265741 /app/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0x77a741) [0x7efc64265741]
3       0x7efc643b1955 void tensorrt_llm::kernels::invokeOneOrTwoShotAllReduceKernel<__half>(tensorrt_llm::kernels::AllReduceParams&, tensorrt_llm::kernels::AllReduceStrategyType, CUstream_st*) + 117
4       0x7efdb045fb8d tensorrt_llm::plugins::AllreducePlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 973
5       0x7efc1f705ba9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10cdba9) [0x7efc1f705ba9]
6       0x7efc1f6db6af /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a36af) [0x7efc1f6db6af]
7       0x7efc1f6dd320 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10a5320) [0x7efc1f6dd320]
8       0x7efc66147024 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeContext(int) + 52
9       0x7efc6614ad61 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeBatch(std::map<unsigned long, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > >&) + 1025
10      0x7efc6614e598 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&) + 3736
11      0x7efc6611d2f4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less<unsigned long>, std::allocator<unsigned long> >&) + 36
12      0x7efc6612452f tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 287
13      0x7effc0e4f253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7effc0e4f253]
14      0x7effc0bdfac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7effc0bdfac3]
15      0x7effc0c71660 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7effc0c71660]
Signal (6) received.
[trt-mixtral-chat-0:3614237] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[trt-mixtral-chat-0:3614237] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Traceback (most recent call last):
  File "/app/scripts/launch_triton_server.py", line 89, in run_cmd
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mpirun', '--allow-run-as-root', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--log-verbose=3', '--log-file=triton_log.txt', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix0_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix1_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix2_', ':', '-n', '1', '/opt/tritonserver/bin/tritonserver', '--model-repository=/data/tgi-data/trt/mixtral-fp16-tp4-triton', '--model-control-mode=explicit', '--load-model=tensorrt_llm', '--grpc-port=8001', '--http-port=80', '--metrics-port=8002', '--disable-auto-complete-config', '--backend-config=python,shm-region-prefix-name=prefix3_', ':']' returned non-zero exit status 1.

cc: @kaiyux

Pernekhan avatar Mar 21 '24 23:03 Pernekhan

Hey! You're right, it is expected to queue the requests. Can you share the engine build command please? Also, the test script or command if possible.

schetlur-nv avatar Mar 27 '24 17:03 schetlur-nv

here is the engine build command: trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4 --max_num_tokens 327680

This is just a simple script we used to make it crash echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo

Pernekhan avatar Mar 28 '24 20:03 Pernekhan

any updates for this @schetlur-nv ?

Pernekhan avatar Apr 02 '24 18:04 Pernekhan

I will take a look at this.

thorjohnsen avatar Apr 04 '24 15:04 thorjohnsen

I have encountered a similar problem. My backend server crashes when the request concurrency is high. I posted the scripts I used in this issue:

https://github.com/triton-inference-server/tensorrtllm_backend/issues/392

silverriver avatar Apr 07 '24 06:04 silverriver

@Pernekhan Can you post the script that you use to make it crash? The link you provided is local to your machine.

thorjohnsen avatar Apr 08 '24 16:04 thorjohnsen

@thorjohnsen here is the script and the request file attached.

echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo

Here is the file 8k-context-req.txt. You can also try any of your own 8k context requests. 8k-context-req.txt

Pernekhan avatar Apr 08 '24 17:04 Pernekhan

Thank you @Pernekhan, can you provide models/ensemble/config.pbtxt? Also, I am not too familiar with curl, is models/ensemble/generate a script? If so, please provide it?

thorjohnsen avatar Apr 08 '24 17:04 thorjohnsen

I used the configs from all_models/inflight_batcher_llm with batch_size 64.

Here is the script that does what curl is trying to do.

import requests
import concurrent.futures

# Define the URL
url = "http://localhost:8000/v2/models/ensemble/generate"

# Define the payload data file
payload_data_file = "8k-context-req.txt"

# Define the number of parallel requests
num_requests = 64

# Define a function to make the request
def make_request(url, data):
    response = requests.post(url, data=data)
    return response.text

# Load the payload data
with open(payload_data_file, 'rb') as file:
    data = file.read()

# Function to make parallel requests
def make_parallel_requests(url, data, num_requests):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # Submit the requests
        futures = [executor.submit(make_request, url, data) for _ in range(num_requests)]
        # Wait for all requests to complete
        for future in concurrent.futures.as_completed(futures):
            try:
                response = future.result()
                print(response)
            except Exception as e:
                print(f"An error occurred: {e}")

# Make parallel requests
make_parallel_requests(url, data, num_requests)

Pernekhan avatar Apr 08 '24 18:04 Pernekhan

Hi @thorjohnsen were you able to reproduce the issue?

Pernekhan avatar Apr 11 '24 18:04 Pernekhan

I am sorry, I was OOTO for a few days. I will resume work on this issue now.

thorjohnsen avatar Apr 15 '24 12:04 thorjohnsen

I can confirm that I am able to reproduce the issue. Now to find the cause.

thorjohnsen avatar Apr 19 '24 06:04 thorjohnsen

I agree with @Pernekhan that crash likely happens when total pending requests reach the max-num-tokens limit. Server runs fine as long as number of parallel requests is low enough to not exceed limit.

thorjohnsen avatar Apr 19 '24 07:04 thorjohnsen

I don't see a crash with LLama-v2-7b, so this issue might only affect MoE models.

thorjohnsen avatar Apr 19 '24 07:04 thorjohnsen

A similar issue was reported internally by somebody at NVIDIA, and a fix is on the way. Daniel Stokes from our side will revisit this issue once that fix has been merged.

thorjohnsen avatar Apr 22 '24 04:04 thorjohnsen

Any updates on this?

We using v0.10.0 with default BLS config from: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm

Same issue with 8K context requests.

8xA100 80G, model is: llama2 13B

It is much stable with KV Cahche reuse disabled, but significantly slower.

ekarmazin avatar Jul 03 '24 17:07 ekarmazin

This is the error we getting with 8K request with KV Cache reuse enabled:

[TensorRT-LLM][ERROR] Encountered an error in forwardSync function: [TensorRT-LLM][ERROR] Assertion failed: blockedTokens.size() <= blockIds.size() (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:572)
1       0x7f84f02692b5 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f83fa32a9a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x6b89a0) [0x7f83fa32a9a0]
3       0x7f83fc27f0ee tensorrt_llm::batch_manager::kv_cache_manager::BlockManager::releaseBlocks(tensorrt_llm::batch_manager::kv_cache_manager::GenerationRequest&, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 574
4       0x7f83fc27f678 tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager::removeSequence(int, std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 296
5       0x7f83fc2abfa6 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::terminateRequest(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&, bool) + 502
6       0x7f83fc2ae3cc tensorrt_llm::batch_manager::TrtGptModelInflightBatching::decoderSync(tensorrt_llm::batch_manager::ScheduledRequests const&, std::unique_ptr<tensorrt_llm::runtime::decoder_batch::Token const, std::default_delete<tensorrt_llm::runtime::decoder_batch::Token const> > const&) + 1724

ekarmazin avatar Jul 03 '24 18:07 ekarmazin