tlyhenry

Results 5 comments of tlyhenry

I had this issue for my qwen2.5 14b model in 0.17 Does it means I need to upgrade to a newer version? __

Sure. We are building it by passing the parameters. subprocesses.run_command( __cuda_visible_devices(num_gpu), "trtllm-build", "--checkpoint_dir", to_local_dir(checkpoint_dir), "--output_dir", tmpdir, "--gemm_plugin", dtype, f"--max_seq_len {max_seq_len}", f"--max_input_len {max_input_len}", f"--max_batch_size {max_batch_size}", "--use_paged_context_fmha enable" if use_paged_context_fmha else "",...

Another thing is I noticed that after each request, our GPU memory is increasing.

I think it never release the memory. I tried to make the call multiple times. I do see it incrementally increaes, it never goes down. So after certain amount of...

@dominicshanshan I think when we enable the context logits, the memory usage is keep increasing. I tried to turn off the context logits and using the generation logits only. I...