lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

multi gpus for full finetune

Open qiqiApink opened this issue 1 year ago • 3 comments

I want to run the full.py on multi gpus, but only one GPU was used.

Using bfloat16 Automatic Mixed Precision (AMP)
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Global seed set to 1337

Can you help me to solve this?

qiqiApink avatar Aug 10 '23 03:08 qiqiApink

I don't have a good explanation, but maybe you accidentally set devices = 1 here?

Screenshot 2023-08-10 at 11 41 15 AM

rasbt avatar Aug 10 '23 16:08 rasbt

No. All the settings are right. Btw, I use slurm to run the code. Is this the problem?

qiqiApink avatar Aug 11 '23 02:08 qiqiApink

There might be SLURM (not Lit-LLaMA-specific) problem with requesting the GPUs. You could add the following PyTorch code at the top to see if the machine indeed has multiple GPUs that are usable in PyTorch:

import torch

num_gpus = torch.cuda.device_count()
print("Number of GPUs available:", num_gpus)

rasbt avatar Aug 11 '23 12:08 rasbt