accelerate
accelerate copied to clipboard
InitProcessGroupKwargs(timeout=timedelta(seconds=3600)) not work !!!!!
System Info
ubuntu 20.04
cuda 11.7
torch 2.0
python 3.8
accelerate 0.19.0.dev0
deepspeed 0.9.2
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
I used my own dataset to run "run_clm_no_trainer.py" and encountered a timeout issue. Then I ran the following code to set a longer waiting time to avoid the timeout problem. However, the error message still shows that the timeout duration is the default 30 minutes instead of the timeout time I set.
code like
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
accelerator = (
Accelerator(kwargs_handlers=[kwargs], log_with=args.report_to,
logging_dir=args.output_dir) if args.with_tracking else Accelerator(kwargs_handlers=[kwargs])
)
error timeout info like
accelerate key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
It seems that the code "kwargs_handlers=[kwargs]" does not set a default timeout duration.
Expected behavior
I used my own dataset to run "run_clm_no_trainer.py" and encountered a timeout issue. Then I ran the following code to set a longer waiting time to avoid the timeout problem. However, the error message still shows that the timeout duration is the default 30 minutes instead of the timeout time I set.
code like
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
accelerator = (
Accelerator(kwargs_handlers=[kwargs], log_with=args.report_to,
logging_dir=args.output_dir) if args.with_tracking else Accelerator(kwargs_handlers=[kwargs])
)
error timeout info like
accelerate key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
It seems that the code "kwargs_handlers=[kwargs]" does not set a default timeout duration.
cc @muellerzr
Please give us the output of accelerate env
and how you are creating your DataLoaders and Dataset (rough code will work)
Please give us the output of
accelerate env
and how you are creating your DataLoaders and Dataset (rough code will work) The key issue is not with the dataset and dataloader, but rather that my modifications to the waiting time have not taken effect. You can reproduce my code by using time.sleep for more than 30 minutes. use deepspeed stage3
#!/bin/bash
OUTPUT=$1
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output_belle1b_accelerate
fi
mkdir -p $OUTPUT
nohup accelerate launch --config_file=config/default_config_single_machine.yaml train/train_sft_accelerate.py \
--train_file=data/train.json \
--model_name_or_path=/data/belle-1b-260b \
--output_dir=$OUTPUT \
--max_length=1024 \
--num_train_epochs=5 \
--learning_rate=1e-5 \
--per_device_train_batch_size=7 \
--per_device_eval_batch_size=7 \
--eval_step=10 \
--checkpointing_steps="epoch" > $OUTPUT/train.log 2>&1 &
You can reproduce my issue by doing the following: my dataset tokenizes all data during loading, which takes longer than 30 minutes. I then set the waiting time to be more than 30 minutes, but the timeout error message still shows 30 minutes. use deep_speeed stage3
@bestpredicts please run accelerate env
in your CLI and give us your output to help us further with debugging, so we can reproduce your entire setup including configuration.
- `Accelerate` version: 0.19.0.dev0
- Platform: Linux-4.19.96-x86_64-with-glibc2.10
- Python version: 3.8.13
- Numpy version: 1.22.4
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- System RAM: 503.82 GB
- GPU type: NVIDIA A100-SXM4-40GB
- `Accelerate` default config:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false
muellerzr
@muellerzr muellerzr
How are you creating your Accelerator
object and the Dataset
? Is it an IterableDataset
?
How are you creating your
Accelerator
object and theDataset
? Is it anIterableDataset
?
class customer_dataset:
def __init__(self,df):
self.df = pd.read_csv(df)
self.text = self.df['text'].tolist()
self.all_data =tokenizer(self.text) # tokenizer all data to {"input_ids":xxx,"attention_mask":xxx,"token_type_ids":xxx. ,this spend time over 1 hours
def __len__(self)
return len(self.text)
def __getitem__(self,idx):
return self.all_data[idx]
it seems same issue ? #1129
Looks to be so
CC @pacman100 to verify however
Hello @bestpredicts, as the config has zero3_init_flag
set to True, it results in DeepSpeed using default timeout only. And you have mentioned the correct issue with respect to this.
Hello @bestpredicts, as the config has
zero3_init_flag
set to True, it results in DeepSpeed using default timeout only. And you have mentioned the correct issue with respect to this.
set zero3_init_flag=false ,also get same error
05/10/2023 23:07:45 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:07:55 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:05 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:15 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
05/10/2023 23:08:26 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:2 (world_size=8, worker_count=1, timeout=0:30:00)
Code:
from accelerate import Accelerator, InitProcessGroupKwargs
import torch.distributed as dist
from datetime import datetime, timedelta
kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=96000))
accelerator = Accelerator(kwargs_handlers=[kwargs])
accelerate env:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero3_save_16bit_model: false
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Command to run:
export TORCH_CPP_LOG_LEVEL=INFO
accelerate launch --config_file issue_1401.yaml issue_1401.py
Output:
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54446.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54448.
[2023-05-11 04:54:19,433] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54450.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54452.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54454.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on [::ffff:127.0.0.1]:54456.
[I ProcessGroupNCCL.cpp:665] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 96000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:665] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 96000000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:842] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:844] [Rank 0] NCCL watchdog thread terminated normally
[I ProcessGroupNCCL.cpp:844] [Rank 1] NCCL watchdog thread terminated normally
working as expected
Can you install the latest release 0.19.0 instead of 0.19.0.dev and let us know?
I also find this promblem.
I also find this promble, does the new version solve it?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.