QI JUN
QI JUN
Hi @mahmoodn , the 10GB device memory is not enough to quantize the GPT-J model. Please refer to the answer of a similar issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1932#issuecomment-2227560712
Hi @mahmoodn , we do have plans to upload pre-quantized weights to HF model hub in the future.
/bot reuse-pipeline
@Tracin Could you please have a look? Thanks
@nv-guomingz Could you please take a look? Thanks
Hi @BrechtCorbeel, thanks for your comments. > Since the previous pure Python CapacityScheduler ran into issues with pybind overhead, are there any benchmarks planned to compare the new implementation? In...
Hi @ngockhanh5110 , The original HF LLaMA 8B checkpoint is about 16G, the `w4a8_awq` quantized checkpoint is about 4G. And there are also some intermediate memory consumption when doing quantization....
@Tracin could you please have a look? Thanks