accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Hanging with FSDP but not DS

Open ccruttjr opened this issue 1 year ago • 3 comments

System Info

- `Accelerate` version: 0.26.1
- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.11.5
- Numpy version: 1.26.3
- PyTorch version (GPU?): 2.1.2 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.28 GB
- GPU type: NVIDIA GeForce RTX 3070 Ti (I have 6)

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

I included my command to run the code, the logs, the code, and the yaml for accelerate. For my code, I parse an XML document to create the dataset. Then I fine-tune the tiny llama model with it. I took some gratuitous code out.

The hanging seems to occur at the bottom of the script at

    accelerator.unwrap_model(model).save_pretrained(
        args.save_location,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
        state_dict=accelerator.get_state_dict(model),
    )

This is what I ran for DP and FSDP

$ NCCL_P2P_DISABLE=1 accelerate launch --config_file accConfigs/ds.yaml finetuneWithAcc.py --batch_size 1 --seed 42 --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --save_location saved_ds_1000
$ NCCL_P2P_DISABLE=1 accelerate launch --config_file accConfigs/fsdp.yaml finetuneWithAcc.py --batch_size 1 --seed 42 --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --save_location saved_fsdp_1000

This was the log for fsdp before it hung forever- as you can see, it only tells me the total execution time of 5 GPUs (I have 6)

Map: 100%|███████████████████| 669/669 [00:00<00:00, 3076.21 examples/s]
saving tokenizer
Map:   0%|                   | 0/669 [00:00<?, ? examples/s]
saved tokenizer
Map: 100%|███████████████████| 669/669 [00:00<00:00, 2960.52 examples/s]
Map: 100%|███████████████████| 669/669 [00:00<00:00, 2176.14 examples/s]
Map: 100%|███████████████████| 669/669 [00:00<00:00, 2010.09 examples/s]
Map: 100%|███████████████████| 669/669 [00:00<00:00, 2243.52 examples/s]
Map: 100%|███████████████████| 669/669 [00:00<00:00, 2061.44 examples/s]
Training: 100%|███████████████████| 15/15 [07:57<00:00, 31.82s/it]
Training: 100%|███████████████████| 15/15 [07:57<00:00, 31.83s/it]
Training: 100%|███████████████████| 15/15 [07:57<00:00, 31.85s/it]
Training: 100%|███████████████████| 15/15 [07:57<00:00, 31.85s/it]
Training: 100%|███████████████████| 15/15 [07:57<00:00, 31.81s/it]
Training: 100%|███████████████████| 15/15 [07:56<00:00, 31.80s/it]
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.58s/it]
Epoch: 0, Average Training Loss: 4.8521544456481935, Average Evaluation Loss: 2.5526844263076782
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.58s/it]
Epoch: 0, Average Training Loss: 4.762742805480957, Average Evaluation Loss: 2.8283395767211914
Evaluating: 100%|███████████████████|| 2/2 [00:19<00:00,  9.58s/it]
Epoch: 0, Average Training Loss: 6.050344824790955, Average Evaluation Loss: 2.4796172380447388
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.59s/it]
Epoch: 0, Average Training Loss: 6.8079283237457275, Average Evaluation Loss: 2.7605079412460327
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.59s/it]
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.59s/it]Epoch: 0, Average Training Loss: 4.6800737380981445, Average Evaluation Loss: 2.5303114652633667
Epoch: 0, Average Training Loss: 5.779373407363892, Average Evaluation Loss: 2.2026368379592896
Training: 100%|███████████████████| 15/15 [07:53<00:00, 31.55s/it]
Training: 100%|███████████████████| 15/15 [07:53<00:00, 31.55s/it]
Training: 100%|███████████████████| 15/15 [07:53<00:00, 31.55s/it]
Training: 100%|███████████████████| 15/15 [07:53<00:00, 31.55s/it]
Training: 100%|███████████████████| 15/15 [07:53<00:00, 31.55s/it]
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.53s/it]
Epoch: 1, Average Training Loss: 1.7872311313947042, Average Evaluation Loss: 1.5572392344474792
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.52s/it]
Epoch: 1, Average Training Loss: 2.213330316543579, Average Evaluation Loss: 2.1311458349227905
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.53s/it]
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.53s/it]
Epoch: 1, Average Training Loss: 1.7207003752390544, Average Evaluation Loss: 1.5601721405982971
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.54s/it]
Epoch: 1, Average Training Loss: 1.8340094844500223, Average Evaluation Loss: 1.997363269329071
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.54s/it]
Epoch: 1, Average Training Loss: 1.982068951924642, Average Evaluation Loss: 1.952318787574768
Epoch: 1, Average Training Loss: 1.6115892787774404, Average Evaluation Loss: 2.517484664916992
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Training: 100%|███████████████████| 15/15 [07:52<00:00, 31.52s/it]
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.55s/it]
Epoch: 2, Average Training Loss: 0.9321655054887136, Average Evaluation Loss: 1.508451223373413
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.56s/it]
Epoch: 2, Average Training Loss: 1.2723221063613892, Average Evaluation Loss: 2.1232460737228394
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.57s/it]
Epoch: 2, Average Training Loss: 0.8686126252015431, Average Evaluation Loss: 1.4427779912948608
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.57s/it]
Epoch: 2, Average Training Loss: 1.008501938978831, Average Evaluation Loss: 1.9432660937309265
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.57s/it]
Epoch: 2, Average Training Loss: 0.9696332494417826, Average Evaluation Loss: 2.0952702164649963
Evaluating: 100%|███████████████████| 2/2 [00:19<00:00,  9.57s/it]
Epoch: 2, Average Training Loss: 0.8216088712215424, Average Evaluation Loss: 2.569663166999817
saving
[2024-01-24 11:25:17,563] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 11:25:17,563] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 11:25:17,564] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 11:25:17,564] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 11:25:17,573] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-24 11:25:17,598] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Total Execution Time: 1568.4355781078339 seconds
Total Execution Time: 1568.4076778888702 seconds
Total Execution Time: 1568.4469237327576 seconds
Total Execution Time: 1568.3961865901947 seconds
Total Execution Time: 1568.4128487110138 seconds

For DP, at the end, the total execution time was around 20 extra seconds for the last GPU. When running glances and watch nvidia-smi, I see there is now only one process running rather than six, and that my first GPU is the only one being used still.

Here is my code, followed by the ds.yaml and fsdp.yaml

import html
import re
import xml.etree.ElementTree as ET
from time import time

import pandas as pd
import torch
from accelerate import Accelerator
from bs4 import BeautifulSoup
from datasets import Dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    set_seed,
)


# This allows adjusting training arguments without needing to change the code
def parse_args():
    parser = argparse.ArgumentParser(description="Training script arguments.")
    parser.add_argument("--batch_size", type=int, default=1,
                        help="Batch size for training.")
    parser.add_argument("--mixed_precision", type=str,
                        default="fp16", help="Mixed precision type.")
    parser.add_argument("--lr", type=float, default=5e-5,
                        help="Learning rate.")
    parser.add_argument("--num_epochs", type=int, default=3,
                        help="Number of training epochs.")
    parser.add_argument("--seed", type=int, default=None, help="Random seed.")
    parser.add_argument("--num_warmup_steps", type=int,
                        default=100, help="Number of warm-up steps.")
    parser.add_argument("--num_processes", type=int,
                        default=6, help="Number of gpus to use.")
    parser.add_argument("--model_name", type=str,
                        default="gpt2-xl", help="Model to use.")
    parser.add_argument("--data_location", type=str,
                        default="data/GI_1000.xml", help="File location for data.")
    parser.add_argument("--save_location", type=str,
                        default="saved_1000", help="File location for data.")
    parser.add_argument("--gradient_accumulation_steps",
                        type=int, default=1, help="Gradient accumulation steps.")
    return parser.parse_args()


# 1. Have Transformer's determine the best tokenizer for the given model
# 2. Convert XML to readable dataset. Have the first GPU run it first so multiple GPUs aren't trying to edit the XML at
#    the same time
# 3. Set the max length and padding of each eConsult and how wewant to tokenize the dataset
# 4. Split dataset into training dataset and eval 80/20
# 5. Distribute tokenized datasets across multiple GPUs as to not run out of memory
# 6. Create/return dataloader with the given data for the trainer to use
def get_dataloaders(accelerator: Accelerator, batch_size, model_name, data_location, save_location):
    # 1
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # 2
    with accelerator.main_process_first():
        dataset = Dataset.from_pandas(process_dataset(data_location))

    # 3
    def tokenize_function(examples):
        return tokenizer(examples["conversation"], padding="max_length", truncation=True, max_length=128)

    with accelerator.main_process_first():
        tokenized_dataset = dataset.map(tokenize_function, batched=True)

    tokenized_dataset.set_format(
        "torch", columns=["input_ids", "attention_mask"])

    # 4
    split_datasets = tokenized_dataset.train_test_split(test_size=0.2)
    tokenized_train_dataset = split_datasets["train"]
    tokenized_eval_dataset = split_datasets["test"]

    if accelerator.is_main_process:
        print("saving tokenizer")
        # Saving the tokenizer
        tokenizer.save_pretrained(save_location)
        print("saved tokenizer")

    # 5
    train_sampler = DistributedSampler(
        tokenized_train_dataset, num_replicas=accelerator.num_processes, rank=accelerator.process_index, shuffle=True
    )

    eval_sampler = DistributedSampler(
        tokenized_eval_dataset, num_replicas=accelerator.num_processes, rank=accelerator.process_index, shuffle=False
    )

    # 6
    train_dataloader = DataLoader(
        tokenized_train_dataset,
        batch_size=batch_size,
        drop_last=True,
        sampler=train_sampler
    )

    eval_dataloader = DataLoader(
        tokenized_eval_dataset,
        batch_size=batch_size*2,
        drop_last=(accelerator.mixed_precision == "fp8"),
        sampler=eval_sampler
    )

    return train_dataloader, eval_dataloader


# 1. Initialize accelerator with mixed percision and define training parameters via arguments given in command line
# 2. Sets seed (if given as a command line argument) for reproducability
# 3. Get dataloaders
# 4. Initialize more training perameters and "prepare"/optimize them via Accelerate
# 5. Train/fine-tune model with new data & set parameters using FSDP
# 6. Evaluate quality of trainer for that epoch
# 7. Have the first GPU save the newly fine-tuned dataset
def training_function(args):
    # 1
    accelerator = Accelerator(mixed_precision=args.mixed_precision,
                              gradient_accumulation_steps=args.gradient_accumulation_steps)

    lr = args.lr
    num_epochs = args.num_epochs
    batch_size = args.batch_size
    num_warmup_steps = args.num_warmup_steps

    # 2
    if args.seed:
        set_seed(args.seed)

    # 3
    train_dataloader, eval_dataloader = get_dataloaders(
        accelerator, batch_size, args.model_name, args.data_location, args.save_location)

    # 4
    # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = AutoModelForCausalLM.from_pretrained(args.model_name)
    # model = accelerator.prepare(model)

    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=(len(train_dataloader) *
                            num_epochs) // args.gradient_accumulation_steps
    )

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Initialize logging variables
    total_train_loss = 0
    total_eval_loss = 0

    # 5
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        for batch in tqdm(train_dataloader, desc="Training"):
            with accelerator.accumulate(model):
                # Process the batch
                inputs = {k: v.to(accelerator.device)
                          for k, v in batch.items()}
                if "labels" not in inputs:
                    inputs["labels"] = inputs["input_ids"]

                outputs = model(**inputs)
                loss = outputs.loss
                total_train_loss += loss.item()
                accelerator.backward(loss)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()

        accelerator.wait_for_everyone()

        # 6
        # Evaluation loop after each training epoch
        model.eval()
        total_eval_loss = 0
        for batch in tqdm(eval_dataloader, "Evaluating"):
            with torch.no_grad():
                inputs = {k: v.to(accelerator.device)
                          for k, v in batch.items()}
                if "labels" not in inputs:
                    inputs["labels"] = inputs["input_ids"]

                outputs = model(**inputs)
                loss = outputs.loss
                total_eval_loss += loss.item()

        # # Log the average losses
        avg_train_loss = total_train_loss / len(train_dataloader)
        avg_eval_loss = total_eval_loss / len(eval_dataloader)
        print(
            f"Epoch: {epoch}, Average Training Loss: {avg_train_loss}, Average Evaluation Loss: {avg_eval_loss}")

        accelerator.wait_for_everyone()

    # 7
    accelerator.wait_for_everyone()
    accelerator.print("saving")
    accelerator.unwrap_model(model).save_pretrained(
        args.save_location,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
        state_dict=accelerator.get_state_dict(model),
    )


def main():
    args = parse_args()
    training_function(args)


if __name__ == "__main__":
    start = time()
    main()
    print(f"Total Execution Time: {time() - start} seconds")

ds.yaml

# 1182.9515426158905 seconds
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: null
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 6
use_cpu: false

fsdp.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 6
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Expected behavior

I have code that works when using DeepSpeed but hangs when saving a fine-tuned model via FSDP.

ccruttjr avatar Jan 24 '24 19:01 ccruttjr