lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Unable to run inference on multiple GPUs

Open ankitphogat opened this issue 2 years ago • 2 comments

When I try to run the inference (using the generate.py file), I am unable to do so using multiple GPUs. I tried with 7B model, it works fine on one GPU, but the same model doesn't run when I set 'devices=4' which is strange (unless I'm doing something wrong, I just set 'devices=4' followed by fabric.launch()). Ultimately I want to run bigger models using all GPUs but I'm unable to get the smallest one running (which runs fine on a single GPU). My rig has 4 NVIDIA A10G GPUs, each with 23028 MiB memory (as per nvidia-smi). The error I get is following: ''' torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.55 GiB (GPU 2; 22.02 GiB total capacity; 12.55 GiB already allocated; 8.76 GiB free; 12.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ''' The error is repeated 4 times, I assume once for each GPU.

ankitphogat avatar May 31 '23 14:05 ankitphogat

To run larger models split across devices, generation needs to add support for a technique like FSDP. We'll be implementing this soon.

In the meantime, you can use quantization to run larger models: https://github.com/Lightning-AI/lit-llama/blob/main/howto/inference.md#run-lit-llama-on-consumer-devices

carmocca avatar Jun 02 '23 19:06 carmocca

Any update on this?

alexgshaw avatar Jun 26 '23 23:06 alexgshaw