TensorRT-LLM Getting runtime error OOM for generation logits and never go back to normal

Hey, I am using the tensorrtllm version 0.15 My model is a llama3.2 3b model. And I am using the A100. I am using the generation logits for my use case. I found that when I increase my traffic to 5+ for 30m, I got the error:

[ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)\n1 0x7f45c6d96865 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149\n2 0x7f45c8ab8d23 tensorrt_llm::runtime::BufferManager::gpu(nvinfer1::Dims64, nvinfer1::DataType) const + 515\n3 0x7f45c8f82881 tensorrt_llm::batch_manager::RuntimeBuffers::reshape(tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 2033\n4 0x7f45c8f84fc7 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 103\n5 0x7f45c8faa457 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::prepareBuffers(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 167\n6 0x7f45c8faa9ce tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>,

And the triton is not marked it failed. The /healthy endpoint is returning healthy

Also, I found the GPU memroy increase from 79G to 80G.

Mar 05 '25 22:03 tlyhenry

hi, @tlyhenry, could you provide details of your trtllm-build command which we can looking through?

Mar 06 '25 02:03 dominicshanshan

Sure. We are building it by passing the parameters. subprocesses.run_command( __cuda_visible_devices(num_gpu), "trtllm-build", "--checkpoint_dir", to_local_dir(checkpoint_dir), "--output_dir", tmpdir, "--gemm_plugin", dtype, f"--max_seq_len {max_seq_len}", f"--max_input_len {max_input_len}", f"--max_batch_size {max_batch_size}", "--use_paged_context_fmha enable" if use_paged_context_fmha else "", "--gather_context_logits" if gather_context_logits else "", f"--remove_input_padding=enable" if use_remove_input_padding else "", f"--paged_kv_cache=enable" if use_paged_kv_cache else "", f"--context_fmha=enable" if use_context_fmha else "", f"--gather_generation_logits" if gather_generation_logits else "",)

which we turn on the gather_generation_logits, paged_kv_cache, context_fmha, gather_context_logits, use_paged_context_fmha

Mar 06 '25 17:03 tlyhenry

Another thing is I noticed that after each request, our GPU memory is increasing.

Mar 06 '25 18:03 tlyhenry

@tlyhenry From trtllm-build commands is fine, from the error log shows 'Out of Memory'(OOM) typically means either your max_batch_size value is too large or kv cache size cannot hold during runtime, also you have to be aware your inference model parameter size, if is hundreds of billions combined with your dtype (e.g., FP16/BF16/FP8 etc) you have to monitor how GPU memory usage goes, in short, suggested lower --max_batch_size value, use quantization as possible which could lower your model size GPU memory and give more room for kv cache, OOM error should gone ..

Mar 07 '25 02:03 dominicshanshan

I think it never release the memory. I tried to make the call multiple times. I do see it incrementally increaes, it never goes down. So after certain amount of call, it is OOM.

Mar 07 '25 23:03 tlyhenry

@dominicshanshan I think when we enable the context logits, the memory usage is keep increasing. I tried to turn off the context logits and using the generation logits only. I don't see this issue.

But I concern it might be a bug for the context logits feature. Can you check on your end. I am using a 3B model on the A100 machine.

Mar 10 '25 16:03 tlyhenry

@tlyhenry Maybe there is bug in there and fixed in new release, could you try release/0.17 and main branch?

Mar 11 '25 02:03 dominicshanshan

Issue has not received an update in over 14 days. Adding stale label.

Mar 25 '25 03:03 github-actions[bot]

This issue was closed because it has been 14 days without activity since it has been marked as stale.

Apr 08 '25 03:04 github-actions[bot]