Getting runtime error OOM for generation logits and never go back to normal
Hey, I am using the tensorrtllm version 0.15 My model is a llama3.2 3b model. And I am using the A100. I am using the generation logits for my use case. I found that when I increase my traffic to 5+ for 30m, I got the error:
[ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mMemPool->getPool(), mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:125)\n1 0x7f45c6d96865 void tensorrt_llm::common::check<cudaError>(cudaError, char const*, char const*, int) + 149\n2 0x7f45c8ab8d23 tensorrt_llm::runtime::BufferManager::gpu(nvinfer1::Dims64, nvinfer1::DataType) const + 515\n3 0x7f45c8f82881 tensorrt_llm::batch_manager::RuntimeBuffers::reshape(tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 2033\n4 0x7f45c8f84fc7 tensorrt_llm::batch_manager::RuntimeBuffers::prepareStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int, int, tensorrt_llm::batch_manager::DecoderBuffers&, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::kv_cache_manager::KVCacheManager*, tensorrt_llm::batch_manager::rnn_state_manager::RnnStateManager*, std::map<unsigned long, std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::vector<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig, std::allocator<tensorrt_llm::runtime::LoraCache::TaskLayerModuleConfig> > > > > const&, tensorrt_llm::runtime::TllmRuntime const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&) + 103\n5 0x7f45c8faa457 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::prepareBuffers(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, int) + 167\n6 0x7f45c8faa9ce tensorrt_llm::batch_manager::TrtGptModelInflightBatching::executeStep(std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > > const&, std::vector<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>,
And the triton is not marked it failed. The /healthy endpoint is returning healthy
Also, I found the GPU memroy increase from 79G to 80G.
hi, @tlyhenry, could you provide details of your trtllm-build command which we can looking through?
Sure. We are building it by passing the parameters. subprocesses.run_command( __cuda_visible_devices(num_gpu), "trtllm-build", "--checkpoint_dir", to_local_dir(checkpoint_dir), "--output_dir", tmpdir, "--gemm_plugin", dtype, f"--max_seq_len {max_seq_len}", f"--max_input_len {max_input_len}", f"--max_batch_size {max_batch_size}", "--use_paged_context_fmha enable" if use_paged_context_fmha else "", "--gather_context_logits" if gather_context_logits else "", f"--remove_input_padding=enable" if use_remove_input_padding else "", f"--paged_kv_cache=enable" if use_paged_kv_cache else "", f"--context_fmha=enable" if use_context_fmha else "", f"--gather_generation_logits" if gather_generation_logits else "",)
which we turn on the gather_generation_logits, paged_kv_cache, context_fmha, gather_context_logits, use_paged_context_fmha
Another thing is I noticed that after each request, our GPU memory is increasing.
@tlyhenry From trtllm-build commands is fine, from the error log shows 'Out of Memory'(OOM) typically means either your max_batch_size value is too large or kv cache size cannot hold during runtime, also you have to be aware your inference model parameter size, if is hundreds of billions combined with your dtype (e.g., FP16/BF16/FP8 etc) you have to monitor how GPU memory usage goes, in short, suggested lower --max_batch_size value, use quantization as possible which could lower your model size GPU memory and give more room for kv cache, OOM error should gone ..
I think it never release the memory. I tried to make the call multiple times. I do see it incrementally increaes, it never goes down. So after certain amount of call, it is OOM.
@dominicshanshan I think when we enable the context logits, the memory usage is keep increasing. I tried to turn off the context logits and using the generation logits only. I don't see this issue.
But I concern it might be a bug for the context logits feature. Can you check on your end. I am using a 3B model on the A100 machine.
@tlyhenry Maybe there is bug in there and fixed in new release, could you try release/0.17 and main branch?
Issue has not received an update in over 14 days. Adding stale label.
This issue was closed because it has been 14 days without activity since it has been marked as stale.