QI JUN comments

Results 75 comments of


                                            QI JUN

How to control out of memory error with PYTORCH_CUDA_ALLOC_CONF?

Hi @mahmoodn , the 10GB device memory is not enough to quantize the GPT-J model. Please refer to the answer of a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1932#issuecomment-2227560712

How to control out of memory error with PYTORCH_CUDA_ALLOC_CONF?

Hi @mahmoodn , we do have plans to upload pre-quantized weights to HF model hub in the future.

chore: Handle qwen2audio inputs ids expansion during processing

/bot reuse-pipeline

Support int type zero-points in weight-only GEMM

@Tracin Could you please have a look? Thanks

feat : reduce trt engine build time in testing

/bot run

feat : reduce trt engine build time in testing

/bot run

Support for Mistral Nemo

@nv-guomingz Could you please take a look? Thanks

[RFC] [PyTorch Flow] Re-implement LlmRequest and Scheduler in pure Python

Hi @BrechtCorbeel, thanks for your comments. > Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation? In...

GPU OOM Error When Quantizing Llama 3 8b

Hi @ngockhanh5110 , The original HF LLaMA 8B checkpoint is about 16G, the `w4a8_awq` quantized checkpoint is about 4G. And there are also some intermediate memory consumption when doing quantization....

does NVIDIA L20 GPUs support FP8 quantization?

@Tracin could you please have a look? Thanks