torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

70B Fine-tuning GPUs Utilization

Open fabiogeraci opened this issue 1 year ago • 4 comments

          openmpi script, launch cli
mpirun \
    -np $TOTAL_NUM_GPUS \
    -H \$MPI_HOST_STRING \
    -x PATH \
    -bind-to none \
    -map-by slot \
    --mca pml ob1 --mca btl ^openib \
    --display-allocation \
    --display-map \
    python3 src/full_finetune_distributed.py \
    --config config_files/8B_full_distributed.yaml \
    optimizer_in_bwd=False

full_finetune_distributed.py

if int(os.environ.get("NUM_NODES")) > 1:
    from torch.distributed._tensor import init_device_mesh
    mesh_2d = init_device_mesh("cuda",
                               mesh_shape=(int(os.environ.get("NUM_NODES")),
                                           int(os.environ['WORLD_SIZE']) // 2),
                                           mesh_dim_names=("dp", "tp"))
else:
    mesh_2d = None

training.shard_model(
    model=model,
    shard_conditions=fsdp_shard_conditions,
    cpu_offload=fsdp_cpu_offload,
    reshard_after_forward=reshard_after_forward,
    mesh=mesh_2d,
)

_distributed.py

def shard_model(
    model: TransformerDecoder,
    shard_conditions: List[Callable[[str, nn.Module], bool]],
    *,
    cpu_offload: bool,
    reshard_after_forward: bool = True,
    mesh: Optional[DeviceMesh] = None # <-- Add this line
) -> None:
if mesh is not None: # <-- Add this line
        fsdp_kwargs["mesh"] = mesh # <-- Add this line

Originally posted by @fabiogeraci in https://github.com/pytorch/torchtune/issues/2018#issuecomment-2528157224

fabiogeraci avatar Dec 10 '24 10:12 fabiogeraci

I am using the configuration above to fine tuning 70B model on 2 nodes with 8 gpus each. the job took 75minutes to compile (is that usual?)

I also noticed that one of the 16 gpus wan not used at all, i hope the video helps i also attached the nccl 70b_nccl.txt Screencast from 10-12-24 09:59:54.webm

the job was killed because, any suggestions

# LSBATCH: User input
#BSUB -J gpu-test
#BSUB -o /nfs/users/nfs_f/fg12/scripts/logs/gpu-test_o.%J
#BSUB -e /nfs/users/nfs_f/fg12/scripts/logs/gpu-test_e.%J
#BSUB -n 128
#BSUB -q gpu-parallel
#BSUB -gpu "num=8:gmem=80000:mode=shared:block=yes"
#BSUB -M 768G
#BSUB -R "select[mem>768G] rusage[mem=768G] span[ptile=64]"

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with signal termination: 9.

Resource usage summary:

    CPU time :                                   69121.00 sec.
    Max Memory :                                 793870 MB
    Average Memory :                             398370.69 MB
    Total Requested Memory :                     1572864.00 MB
    Delta Memory :                               778994.00 MB
    Max Swap :                                   -
    Max Processes :                              559
    Max Threads :                                5356
    Run time :                                   6266 sec.
    Turnaround time :                            6269 sec.

70b_config.txt

fabiogeraci avatar Dec 10 '24 10:12 fabiogeraci

Thanks for the report! Based on your config and the setup you have, I don't see immediately why this would hit your specified memory limit of 768G. Let me get ahold of a multi-node setup today and test this out.

joecummings avatar Dec 10 '24 11:12 joecummings

Hey @fabiogeraci, just updating you on this. I'm waiting on a request for multi-node server (PyTorch has limited quantity). If I don't hear back today, I'll just rent one out on Lambda Labs or something.

joecummings avatar Dec 11 '24 15:12 joecummings

Hey @fabiogeraci, just updating you on this. I'm waiting on a request for multi-node server (PyTorch has limited quantity). If I don't hear back today, I'll just rent one out on Lambda Labs or something.

thanks you

fabiogeraci avatar Dec 11 '24 15:12 fabiogeraci

@joecummings any progress?

fabiogeraci avatar Jan 29 '25 12:01 fabiogeraci

@joecummings any progress?

Thanks for following up on this! I managed to set up a SLURM cluster to test and have a PR up for review with a mini tutorial as well: #2301

joecummings avatar Jan 29 '25 17:01 joecummings

once it is merged, I will test the new turorial. could you let me when the merge happens, please?

fabiogeraci avatar Jan 29 '25 21:01 fabiogeraci

@fabiogeraci The multi node PR has been merged and should be available in nightlies :)

joecummings avatar Feb 10 '25 18:02 joecummings