accelerate InitProcessGroupKwargs(timeout=timedelta(seconds=3600)) not work !!!!!

System Info

ubuntu 20.04
cuda 11.7
torch 2.0
python 3.8
accelerate 0.19.0.dev0
deepspeed 0.9.2

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I used my own dataset to run "run_clm_no_trainer.py" and encountered a timeout issue. Then I ran the following code to set a longer waiting time to avoid the timeout problem. However, the error message still shows that the timeout duration is the default 30 minutes instead of the timeout time I set.

code like


    kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
    accelerator = (
        Accelerator(kwargs_handlers=[kwargs], log_with=args.report_to,
                    logging_dir=args.output_dir) if args.with_tracking else Accelerator(kwargs_handlers=[kwargs])
    )

error timeout info like

accelerate key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)

It seems that the code "kwargs_handlers=[kwargs]" does not set a default timeout duration.

Expected behavior

I used my own dataset to run "run_clm_no_trainer.py" and encountered a timeout issue. Then I ran the following code to set a longer waiting time to avoid the timeout problem. However, the error message still shows that the timeout duration is the default 30 minutes instead of the timeout time I set.

code like


    kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
    accelerator = (
        Accelerator(kwargs_handlers=[kwargs], log_with=args.report_to,
                    logging_dir=args.output_dir) if args.with_tracking else Accelerator(kwargs_handlers=[kwargs])
    )

error timeout info like

accelerate key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)

It seems that the code "kwargs_handlers=[kwargs]" does not set a default timeout duration.

May 09 '23 15:05 bestpredicts

cc @muellerzr

May 09 '23 15:05 sgugger

Please give us the output of accelerate env and how you are creating your DataLoaders and Dataset (rough code will work)

May 09 '23 16:05 muellerzr

Please give us the output of accelerate env and how you are creating your DataLoaders and Dataset (rough code will work) The key issue is not with the dataset and dataloader, but rather that my modifications to the waiting time have not taken effect. You can reproduce my code by using time.sleep for more than 30 minutes. use deepspeed stage3

#!/bin/bash
OUTPUT=$1
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output_belle1b_accelerate
fi
mkdir -p $OUTPUT

nohup accelerate  launch  --config_file=config/default_config_single_machine.yaml   train/train_sft_accelerate.py \
--train_file=data/train.json \
--model_name_or_path=/data/belle-1b-260b  \
--output_dir=$OUTPUT \
--max_length=1024 \
--num_train_epochs=5 \
--learning_rate=1e-5 \
--per_device_train_batch_size=7 \
--per_device_eval_batch_size=7 \
--eval_step=10 \
--checkpointing_steps="epoch"  > $OUTPUT/train.log 2>&1 &

May 09 '23 21:05 bestpredicts

You can reproduce my issue by doing the following: my dataset tokenizes all data during loading, which takes longer than 30 minutes. I then set the waiting time to be more than 30 minutes, but the timeout error message still shows 30 minutes. use deep_speeed stage3

May 09 '23 21:05 bestpredicts

@bestpredicts please run accelerate env in your CLI and give us your output to help us further with debugging, so we can reproduce your entire setup including configuration.

May 10 '23 06:05 muellerzr


- `Accelerate` version: 0.19.0.dev0
- Platform: Linux-4.19.96-x86_64-with-glibc2.10
- Python version: 3.8.13
- Numpy version: 1.22.4
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- System RAM: 503.82 GB
- GPU type: NVIDIA A100-SXM4-40GB
- `Accelerate` default config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false

May 10 '23 08:05 bestpredicts

muellerzr

@muellerzr muellerzr

May 10 '23 08:05 bestpredicts

How are you creating your Accelerator object and the Dataset? Is it an IterableDataset?

May 10 '23 11:05 muellerzr

How are you creating your Accelerator object and the Dataset? Is it an IterableDataset?

class customer_dataset:
    def __init__(self,df):
        self.df = pd.read_csv(df)
        self.text = self.df['text'].tolist()
        self.all_data =tokenizer(self.text)   # tokenizer all data to {"input_ids":xxx,"attention_mask":xxx,"token_type_ids":xxx. ,this spend time over 1 hours
    def __len__(self)
        return len(self.text)
    def __getitem__(self,idx):
        return self.all_data[idx]

May 10 '23 11:05 bestpredicts

it seems same issue ? #1129

May 10 '23 12:05 bestpredicts

Looks to be so

May 10 '23 12:05 muellerzr

CC @pacman100 to verify however

May 10 '23 12:05 muellerzr

Hello @bestpredicts, as the config has zero3_init_flag set to True, it results in DeepSpeed using default timeout only. And you have mentioned the correct issue with respect to this.

May 10 '23 13:05 pacman100

Hello @bestpredicts, as the config has zero3_init_flag set to True, it results in DeepSpeed using default timeout only. And you have mentioned the correct issue with respect to this.

set zero3_init_flag=false ,also get same error

05/10/2023 23:07:45 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:07:55 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:05 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:15 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:26 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)

May 10 '23 23:05 bestpredicts

Code:

from accelerate import Accelerator, InitProcessGroupKwargs
import torch.distributed as dist
from datetime import datetime, timedelta

kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
accelerator = Accelerator(kwargs_handlers=[kwargs])

accelerate env:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero3_save_16bit_model: false
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Command to run:

export TORCH_CPP_LOG_LEVEL=INFO
accelerate launch --config_file issue_1401.yaml issue_1401.py

Output:

[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54446.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54448.
[2023-05-11 04:54:19,433] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54450.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54452.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54454.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54456.
[I ProcessGroupNCCL.cpp:665] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 96000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:665] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 96000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally

May 11 '23 02:05 pacman100

working as expected

May 11 '23 02:05 pacman100

Can you install the latest release 0.19.0 instead of 0.19.0.dev and let us know?

May 11 '23 02:05 pacman100

I also find this promblem.

May 19 '23 02:05 965694547

I also find this promble, does the new version solve it?

Jun 11 '23 23:06 soap117

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 06 '23 15:07 github-actions[bot]

accelerate accelerate copied to clipboard

InitProcessGroupKwargs(timeout=timedelta(seconds=3600)) not work !!!!!

System Info

Information

Tasks

Reproduction

Expected behavior

accelerate
accelerate copied to clipboard