CUDA out of memory

Open deltawi opened this issue 1 year ago • 2 comments

I have 4 GPUs RTX A5000 with 24GB memory each, but when I run the example code:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", model_config={"revision": "main"})

I get the error:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It seems not able to run it on multiple GPUs.

Jan 18 '24 12:01 deltawi

Hi @deltawi, if you use the GPTQ 7b model, you will need less than 8GB of GPU memory.

Additionally, if you need to use multiple GPUs, you can use the following command:

llm_lingua = PromptCompressor("TheBloke/Llama-2-7b-Chat-GPTQ", device_map="balanced", model_config={"revision": "main"})

Jan 18 '24 12:01 iofu728