Unable to run inference on multiple GPUs
When I try to run the inference (using the generate.py file), I am unable to do so using multiple GPUs. I tried with 7B model, it works fine on one GPU, but the same model doesn't run when I set 'devices=4' which is strange (unless I'm doing something wrong, I just set 'devices=4' followed by fabric.launch()). Ultimately I want to run bigger models using all GPUs but I'm unable to get the smallest one running (which runs fine on a single GPU). My rig has 4 NVIDIA A10G GPUs, each with 23028 MiB memory (as per nvidia-smi). The error I get is following: ''' torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.55 GiB (GPU 2; 22.02 GiB total capacity; 12.55 GiB already allocated; 8.76 GiB free; 12.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ''' The error is repeated 4 times, I assume once for each GPU.
To run larger models split across devices, generation needs to add support for a technique like FSDP. We'll be implementing this soon.
In the meantime, you can use quantization to run larger models: https://github.com/Lightning-AI/lit-llama/blob/main/howto/inference.md#run-lit-llama-on-consumer-devices
Any update on this?