transformers Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

System Info

torch 2.1.2+cu118, transformers 4.39.3, accelerate 0.29.1, deepspeed 0.14.0, wandb 0.16.6, python 3.9.0

Who can help?

@pacman100 @muellerzr

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

By default, simply adding "report_to": "wandb" as an argument for training_args (for HF Trainer) creates plots (say, for GPU usage) only for the master node on the wandb GUI. By overriding the default wandb.init() as shown below, I can create one entry for each GPU on the wandb GUI, but only that from rank 0 are plotted. For a training instance with 2 nodes (each with 4 GPUs), wandb shows only 4 plots for GPU system metrics.

FYI: I'm using DeepSpeed for distributed training.

training_args = { "learning_rate": max_lr, "do_train": True, "do_eval": False, "group_by_length": True, "length_column_name": "length", "disable_tqdm": False, # "lr_scheduler_type": lr_schedule_fn, # "warmup_steps": warmup_steps, "weight_decay": weight_decay, "per_device_train_batch_size": geneformer_batch_size, "num_train_epochs": epochs, "save_strategy": "steps", "save_steps": np.floor(num_examples / geneformer_batch_size / 8), # 8 saves per epoch "logging_steps": 1000, "output_dir": training_output_dir, "logging_dir": logging_dir, "log_on_each_node": True, "report_to": "wandb", }

training_args = TrainingArguments(**training_args)

print("Starting training.")

wandb.init( project="geneformer_multinode_project", name="geneformer_multinode", tags=["2_node"], group="geneformer_group", )

trainer = GeneformerPretrainer( model=model, .....

Expected behavior

Weights and Biases GUI should show system metrics from all GPUs involved in the training, and not from only those on the master node. FYI: individually logging into the different nodes, I've confirmed that those are indeed being used.

Apr 06 '24 20:04 tnnandi

Hi @tnnandi, thanks for opening this issue!

Integrations like W&B are maintained by the third-party contributors, rather than the transformers team directly.

cc @parambharat Who has recently been working on W&B integrations and trainer

Apr 08 '24 16:04 amyeroberts

Thanks for the info @amyeroberts !

@parambharat Can I please know if this is a known issue, or am I not using it appropriately?

Apr 08 '24 19:04 tnnandi

Hi @tnnandi : Thanks for bringing this to our notice. Let me investigate further and try to first reproduce and understand the issue. Can you please share a full script ? Perhaps as a github gist so that it's easier to reproduce with the same settings ?

Apr 09 '24 05:04 parambharat

Hi @parambharat , please find below the HF trainer code along with the job submission file (please make the required changes based on your environment):

test_wandb.py:

import datetime
import os
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["OMPI_MCA_opal_cuda_support"] = "true"
os.environ["CONDA_OVERRIDE_GLIBC"] = "2.56"
os.environ["HF_HOME"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["TRANSFORMERS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["HF_DATASETS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"

import pickle
import random
import subprocess
import socket
import numpy as np
import pytz
import torch
from datasets import load_from_disk, load_dataset
from transformers import BertConfig, BertTokenizer, AutoTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from transformers import AutoModelForSequenceClassification
import wandb

dataset = load_dataset("yelp_review_full")
# dataset = load_dataset("yelp_review_full", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(2000))

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

training_args = TrainingArguments(
                output_dir="./test_trainer",
                num_train_epochs=100,
                report_to="wandb")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

trainer.train()

job submission file for a 2 node job, with each node having 4 GPUs (please use your wandb key appropriately):

#!/bin/bash -l
#PBS -A GeomicVar
#PBS -l walltime=00:40:00
#PBS -l filesystems=grand
#PBS -l select=2:ngpus=4:gputype=A100:system=polaris
##PBS -q preemptable
#PBS -q debug-scaling
##PBS -q debug
#PBS -N geneformer

cd ${PBS_O_WORKDIR}
echo ${PBS_O_WORKDIR}

module load conda
conda activate /lus/grand/projects/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/geneformer_env

DS_HOSTFILE="./hostfile"
DS_ENVFILE="./.deepspeed_env"

NRANKS=$(wc -l < "${PBS_NODEFILE}")
NGPU_PER_RANK=$(nvidia-smi -L | wc -l)
NGPUS="$((${NRANKS}*${NGPU_PER_RANK}))"
echo "NRANKS: ${NRANKS}, NGPU_PER_RANK: ${NGPU_PER_RANK}, NGPUS: ${NGPUS}"

cat "${PBS_NODEFILE}" > "${DS_HOSTFILE}"
sed -e 's/$/ slots=4/' -i "${DS_HOSTFILE}"

echo "Writing environment variables to: ${DS_ENVFILE}"

echo "PATH=${PATH}" > "${DS_ENVFILE}"
echo "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" >> "${DS_ENVFILE}"
echo "https_proxy=${https_proxy}" >> "${DS_ENVFILE}"
echo "http_proxy=${http_proxy}" >> "${DS_ENVFILE}"

deepspeed \
  --hostfile="${DS_HOSTFILE}" \
   test_wandb.py \
    --deepspeed 2>&1 | tee log_cancer

On the wandb GUI, I can only see GPU usage plots for GPU ids 0,1,2,3. My understanding is that this is because wandb is detecting the local ranks for the GPUs on both nodes (instead of their global ranks) as decided by deepspeed. So, logging data used by wandb for plotting information for GPU IDs 0,1,2,3 from one node is being overwritten by those from another node.

Also, we should ensure the training metrics are not tied to only one node, and represents the global metrics.

Thanks!

Apr 09 '24 19:04 tnnandi

HI @parambharat , please let me know if you could reproduce this issue at your end. Feel free to ask for clarifications!

Apr 11 '24 00:04 tnnandi

hi @tnnandi , I've been having some difficulty in getting a multi-node-multi-gpu machine. Please give me a day or two to investigate. I'll keep you posted.

Apr 11 '24 05:04 parambharat

Thanks for the heads up, @parambharat , your effort is very much appreciated.

Apr 11 '24 05:04 tnnandi

Hello!

Can you sahre the W&B workspace with some context on the runs? What I do most of the time when I create process per node is using the group argument to name the run and giving a name to the run with the global rank id (as per suggestion from EleutherAI), that way I can identify them quickly. Something like this: That being said, what you would like is it having a single plot with all the GPUs? The way you get the metrics is per run (so per node) this way.

Another thing that could be happening is that as only the global_rank 0 is logging metrics the other ranks are not showing the system metrics on the woskpace, but they should be available in the System tab:

Apr 11 '24 06:04 tcapelle

Hi @tcapelle , thank you for the response. Here's a screenshot for GPU usage from one of my runs (that uses 2 nodes, with 4 GPUs on each):

I'd appreciate it if you could tell me how to give names to the runs based on their global rank (or gave different entries on wandb for jobs on different nodes). Sorry, I didn't get the reference you gave to elutherAI. The updated code for a run where I i) use dataset streaming for faster start time, and ii) try to name runs based on the node ID (but on wandb portal, only one name appears) is as follows:


import datetime
import os
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["OMPI_MCA_opal_cuda_support"] = "true"
os.environ["CONDA_OVERRIDE_GLIBC"] = "2.56"
os.environ["HF_HOME"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["TRANSFORMERS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["HF_DATASETS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"

import pickle
import random
import subprocess
import socket
import numpy as np
import pytz
import torch
from datasets import load_from_disk, load_dataset
from transformers import BertConfig, BertTokenizer, AutoTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from transformers import AutoModelForSequenceClassification
import wandb

# dataset = load_dataset("yelp_review_full", split="train[:30%]")
dataset = load_dataset("yelp_review_full", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)# .select(range(2000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)# .select(range(2000))

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

node_id = os.getenv("NODE_ID", socket.gethostname())
print("^^^^^^^^^^^^^^^^^^^^^^^^^^^ node_id ^^^^^^^^^^^^^^^^^^^^^^^^^^^", node_id)
run_group = "test_wandb_hftrainer"
run_name = f"node_{node_id}"  # unique run name for each node
run_id = f"{run_group}_{run_name}"

os.environ["WANDB_RUN_GROUP"] = run_group
os.environ["WANDB_RUN_ID"] = run_id
os.environ["WANDB_NAME"] = run_name

training_args = TrainingArguments(
                output_dir="./test_trainer",
                num_train_epochs=100,
                max_steps=10000,
                report_to="wandb")

#wandb.init(
#project="test_2node_wandb_project",
#name="wandb_2node",
#tags=["2_node"],
#group="wandb_2node_group",
#)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

trainer.train()

Apr 11 '24 08:04 tnnandi

@tcapelle Using the following code creates 8 different instances for the job on wandb(note the use of wandb.init()), but each of the instances contains plots from 4 GPUs (each ranked from 0-3) :

import datetime
import os
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["OMPI_MCA_opal_cuda_support"] = "true"
os.environ["CONDA_OVERRIDE_GLIBC"] = "2.56"
os.environ["HF_HOME"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["TRANSFORMERS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"
os.environ["HF_DATASETS_CACHE"] = "/grand/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug"

import pickle
import random
import subprocess
import socket
import numpy as np
import pytz
import torch
from datasets import load_from_disk, load_dataset
from transformers import BertConfig, BertTokenizer, AutoTokenizer, BertForMaskedLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from transformers import AutoModelForSequenceClassification
import wandb

# dataset = load_dataset("yelp_review_full", split="train[:30%]")
dataset = load_dataset("yelp_review_full", streaming=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)# .select(range(2000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)# .select(range(2000))

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

training_args = TrainingArguments(
                output_dir="./test_trainer",
                num_train_epochs=100,
                max_steps=10000,
                report_to="wandb")

wandb.init(
project="test_2node_wandb_project",
name="wandb_2node",
tags=["2_node"],
group="wandb_2node_group",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

trainer.train()

From "Overview" I see commands like "/lus/grand/projects/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug/test_wandb.py --local_rank=1 --deepspeed", /lus/grand/projects/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug/test_wandb.py --local_rank=3 --deepspeed for each of the jobs.

Another observation: If I turn off the run from the master node on the wandb portal (identified from /lus/grand/projects/GeomicVar/tarak/Geneformer_dec_2023/Geneformer/cancer_train/for_hf_debug/test_wandb.py --local_rank=0 --deepspeed), no figures are shown:

Ideally I would love to have plots for all the 8 GPUs on a single figure.

Apr 11 '24 08:04 tnnandi

Cool, thanks for the heads up. You will need to manually create the runs with the rank name manually. I would do something like:

wandb.init(
  ...
  name=f"node_{global_rank}_local_rank_{local_rank}"
  group=f"node_{global_rank}",
)

I think accelerate has a feature to query the ranks built in (I assume you are using accelerate launch to start training).

Regarding the multi plot on the same panel, let me ask for that.

Apr 11 '24 08:04 tcapelle

please make your wandb project public so I can inspect...

Apr 11 '24 08:04 tcapelle

@tcapelle Here it is https://wandb.ai/tnnandi/test_2node_wandb_project?nw=nwusertnnandi

Apr 11 '24 08:04 tnnandi

Can I ask you to delete the non-relevant runs? How many nodes are you running?

Apr 11 '24 08:04 tcapelle

Just deleted the old 3 runs. You'll find 8 instances now. This is for a 2 node job (each node having 4 GPUs). I'm using deepspeed for distributed training.

Apr 11 '24 08:04 tnnandi

Cool, so one run per GPU, you shouldn't need that. One process per node should suffice. What you can do, is wrap your init call with:


local_rank = int(os.environ['LOCAL_RANK'])
global_rank = int(os.environ['RANK'])

if config.local_rank == 0:
    wandb.init(
        project="cool_project", 
        name=f"node_{global_rank}"
        group="my_fancy_experiment",
    )

or the accelerate equivalent:


accelerator = Accelerate()

if accelerate.is_local_main_process():
    wandb.init(
        project="cool_project", 
        name=f"process_{accelerator.process_index}"
        group="my_fancy_experiment",
    )

Right now, the accelerate integration only let's you log metrics on the "main_process" (rank0 of the main node). This is something that we may need changing for setups where people are interested on gathering metrics across nodes, maybe this could be a parameter on the training args that let you define a logging strategy: "main", "local_main", "all"

Apr 11 '24 09:04 tcapelle

There is currently a bug that hides the system metrics on processes that don't log any metric (this is the case for your non main processes). The workaround is logging a dummy metric:

accelerator = Accelerate()

if accelerator.is_local_main_process:
    wandb.init(
        project="cool_project", 
        name=f"process_{accelerator.process_index}"
        group="my_fancy_experiment",
    )
    wandb.log({"ping":1})

Apr 11 '24 09:04 tcapelle

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 07 '24 08:05 github-actions[bot]