LLMLingua
LLMLingua copied to clipboard
CUDA out of memory
I have 4 GPUs RTX A5000 with 24GB memory each, but when I run the example code:
from llmlingua import PromptCompressor
llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})
I get the error:
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
It seems not able to run it on multiple GPUs.
Hi @deltawi, if you use the GPTQ 7b model, you will need less than 8GB of GPU memory.
Additionally, if you need to use multiple GPUs, you can use the following command:
llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", device_map="balanced", model_config={"revision": "main"})