GPTQ-triton Needs more VRAM than normal GPTQ CUDA version?

Needs more VRAM than normal GPTQ CUDA version?

Open DanielWe2 opened this issue 2 years ago • 3 comments

Thanks, I wanted to try your triton version. But I only have 8 GB RAM.

The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA OOM).

Is that to be expected or can that be solved?

Mar 28 '23 19:03 DanielWe2

Thank you for the bug report.

The ppl script will use the full 2048 context length which on both the original CUDA kernel and Triton kernel uses about 8GB of GPU RAM. That's probably why you're getting OOM. You can modify the ppl script to use a different context length and then it should work fine. I didn't set up a CLI arg to adjust that yet, sorry.

Mar 28 '23 20:03 fpgaminer

No, problem.

What I don't understand: The GPTQ Cuda version works with 2048 context length (the benchmarks that output ppl). So does your version use a little bit more memory?

Mar 28 '23 21:03 DanielWe2

If I recall correctly the benchmarks in the GPTQ-for-LLaMA codebase do some caching and other tricks to lower the inference memory a little bit. Probably just enough to squeeze under that 8G threshold.

Mar 28 '23 21:03 fpgaminer

GPTQ-triton GPTQ-triton copied to clipboard

Needs more VRAM than normal GPTQ CUDA version?

GPTQ-triton
GPTQ-triton copied to clipboard