DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

deepspeed-chat example script run error at step2

Open vpegasus opened this issue 1 year ago • 0 comments

I run command: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node

when the process run into step 2:

Launch command: bash /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m

we encounter the following error, I could not find out what error it is:

[2023-04-21 02:53:26,570] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 02:53:26,627] [INFO] [runner.py:540:main] cmd = /opt/conda/envs/ftchat/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 2,4,4 --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
[2023-04-21 02:53:29,046] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-04-21 02:53:29,046] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-04-21 02:53:29,046] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-04-21 02:53:29,046] [INFO] [launch.py:247:main] dist_world_size=2
[2023-04-21 02:53:29,046] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-04-21 02:53:32,349] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████| 685/685 [00:00<00:00, 197kB/s]
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████| 644/644 [00:00<00:00, 382kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████| 899k/899k [00:00<00:00, 11.0MB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████| 456k/456k [00:00<00:00, 25.2MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████| 441/441 [00:00<00:00, 188kB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████| 663M/663M [00:01<00:00, 395MB/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 556.61it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 641.18it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b25c081aeeca3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 530.42it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b25c081aeeca3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 565.96it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/Dahoas--synthetic-instruct-gptj-pairwise-0b2fd7bd9ea121cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 563.30it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/Dahoas--synthetic-instruct-gptj-pairwise-0b2fd7bd9ea121cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 339.37it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/yitingxie___parquet/yitingxie--rlhf-reward-datasets-f2627438ff1fb9dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 665.45it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/yitingxie___parquet/yitingxie--rlhf-reward-datasets-f2627438ff1fb9dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 736.49it/s]
Found cached dataset webgpt_comparisons (/mnt/disks/data-1/marvin/datasets_cache/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 359.59it/s]
Found cached dataset webgpt_comparisons (/mnt/disks/data-1/marvin/datasets_cache/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 396.06it/s]
Found cached dataset json (/mnt/disks/data-1/marvin/datasets_cache/stanfordnlp___json/stanfordnlp--SHP-10ead9e54f5a107d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 171.82it/s]
Found cached dataset json (/mnt/disks/data-1/marvin/datasets_cache/stanfordnlp___json/stanfordnlp--SHP-10ead9e54f5a107d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 175.38it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/mawenjia/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /home/mawenjia/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 2.921328544616699 seconds
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:348 in <module>                                                            │
│                                                                                                  │
│   345                                                                                            │
│   346                                                                                            │
│   347 if __name__ == "__main__":                                                                 │
│ ❱ 348 │   main()                                                                                 │
│   349                                                                                            │
│                                                                                                  │
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:271 in main                                                                │
│                                                                                                  │
│   268 │   │   rm_model, args.weight_decay)                                                       │
│   269 │                                                                                          │
│   270 │   AdamOptimizer = DeepSpeedCPUAdam if args.offload else FusedAdam                        │
│ ❱ 271 │   optimizer = AdamOptimizer(optimizer_grouped_parameters,                                │
│   272 │   │   │   │   │   │   │     lr=args.learning_rate,                                       │
│   273 │   │   │   │   │   │   │     betas=(0.9, 0.95))                                           │
│   274                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py:71 in        │
│ __init__                                                                                         │
│                                                                                                  │
│    68 │   │   self.adam_w_mode = 1 if adam_w_mode else 0                                         │
│    69 │   │   self.set_grad_none = set_grad_none                                                 │
│    70 │   │                                                                                      │
│ ❱  71 │   │   fused_adam_cuda = FusedAdamBuilder().load()                                        │
│    72 │   │   # Skip buffer                                                                      │
│    73 │   │   self._dummy_overflow_buf = get_accelerator().IntTensor([0])                        │
│    74 │   │   self.multi_tensor_adam = fused_adam_cuda.multi_tensor_adam                         │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py:445 in    │
│ load                                                                                             │
│                                                                                                  │
│   442 │   │   │                                                                                  │
│   443 │   │   │   return importlib.import_module(self.absolute_name())                           │
│   444 │   │   else:                                                                              │
│ ❱ 445 │   │   │   return self.jit_load(verbose)                                                  │
│   446 │                                                                                          │
│   447 │   def jit_load(self, verbose=True):                                                      │
│   448 │   │   if not self.is_compatible(verbose):                                                │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py:465 in    │
│ jit_load                                                                                         │
│                                                                                                  │
│   462 │   │   │   │   self.build_for_cpu = True                                                  │
│   463 │   │                                                                                      │
│   464 │   │   self.jit_mode = True                                                               │
│ ❱ 465 │   │   from torch.utils.cpp_extension import load                                         │
│   466 │   │                                                                                      │
│   467 │   │   start_build = time.time()                                                          │
│   468 │   │   sources = [self.deepspeed_src_path(path) for path in self.sources()]               │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:19 in <module>   │
│                                                                                                  │
│     16 import torch._appdirs                                                                     │
│     17 from .file_baton import FileBaton                                                         │
│     18 from ._cpp_extension_versioner import ExtensionVersioner                                  │
│ ❱   19 from .hipify import hipify_python                                                         │
│     20 from .hipify.hipify_python import GeneratedFileCleaner                                    │
│     21 from typing import Dict, List, Optional, Union, Tuple                                     │
│     22 from torch.torch_version import TorchVersion                                              │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/hipify/hipify_python.py:34 in     │
│ <module>                                                                                         │
│                                                                                                  │
│     31 import os                                                                                 │
│     32                                                                                           │
│     33 from . import constants                                                                   │
│ ❱   34 from .cuda_to_hip_mappings import CUDA_TO_HIP_MAPPINGS                                    │
│     35 from .cuda_to_hip_mappings import MATH_TRANSPILATIONS                                     │
│     36                                                                                           │
│     37 from typing import Dict, List, Iterator, Optional                                         │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/hipify/cuda_to_hip_mappings.py:34 │
│ in <module>                                                                                      │
│                                                                                                  │
│     31 # As of ROCm 5.0, the version is found in rocm_version.h header file under /opt/rocm/inc  │
│     32 rocm_path = os.environ.get('ROCM_HOME') or os.environ.get('ROCM_PATH') or "/opt/rocm"     │
│     33 try:                                                                                      │
│ ❱   34 │   rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8")      │
│     35 except subprocess.CalledProcessError:                                                     │
│     36 │   print(f"Warning: hipconfig --rocmpath failed, assuming {rocm_path}")                  │
│     37 except (FileNotFoundError, PermissionError):                                              │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:424 in check_output                           │
│                                                                                                  │
│    421 │   │   │   empty = b''                                                                   │
│    422 │   │   kwargs['input'] = empty                                                           │
│    423 │                                                                                         │
│ ❱  424 │   return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,                      │
│    425 │   │   │      **kwargs).stdout                                                           │
│    426                                                                                           │
│    427                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:505 in run                                    │
│                                                                                                  │
│    502 │   │   kwargs['stdout'] = PIPE                                                           │
│    503 │   │   kwargs['stderr'] = PIPE                                                           │
│    504 │                                                                                         │
│ ❱  505 │   with Popen(*popenargs, **kwargs) as process:                                          │
│    506 │   │   try:                                                                              │
│    507 │   │   │   stdout, stderr = process.communicate(input, timeout=timeout)                  │
│    508 │   │   except TimeoutExpired as exc:                                                     │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:951 in __init__                               │
│                                                                                                  │
│    948 │   │   │   │   │   self.stderr = io.TextIOWrapper(self.stderr,                           │
│    949 │   │   │   │   │   │   │   encoding=encoding, errors=errors)                             │
│    950 │   │   │                                                                                 │
│ ❱  951 │   │   │   self._execute_child(args, executable, preexec_fn, close_fds,                  │
│    952 │   │   │   │   │   │   │   │   pass_fds, cwd, env,                                       │
│    953 │   │   │   │   │   │   │   │   startupinfo, creationflags, shell,                        │
│    954 │   │   │   │   │   │   │   │   p2cread, p2cwrite,                                        │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:1754 in _execute_child                        │
│                                                                                                  │
│   1751 │   │   │   │   │   │   │   for dir in os.get_exec_path(env))                             │
│   1752 │   │   │   │   │   fds_to_keep = set(pass_fds)                                           │
│   1753 │   │   │   │   │   fds_to_keep.add(errpipe_write)                                        │
│ ❱ 1754 │   │   │   │   │   self.pid = _posixsubprocess.fork_exec(                                │
│   1755 │   │   │   │   │   │   │   args, executable_list,                                        │
│   1756 │   │   │   │   │   │   │   close_fds, tuple(sorted(map(int, fds_to_keep))),              │
│   1757 │   │   │   │   │   │   │   cwd, env_list,                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: [Errno 12] Cannot allocate memory
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:348 in <module>                                                            │
│                                                                                                  │
│   345                                                                                            │
│   346                                                                                            │
│   347 if __name__ == "__main__":                                                                 │
│ ❱ 348 │   main()                                                                                 │
│   349                                                                                            │
│                                                                                                  │
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:285 in main                                                                │
│                                                                                                  │
│   282 │   │   num_training_steps=args.num_train_epochs * num_update_steps_per_epoch,             │
│   283 │   )                                                                                      │
│   284 │                                                                                          │
│ ❱ 285 │   rm_model, optimizer, _, lr_scheduler = deepspeed.initialize(                           │
│   286 │   │   model=rm_model,                                                                    │
│   287 │   │   optimizer=optimizer,                                                               │
│   288 │   │   args=args,                                                                         │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/__init__.py:165 in initialize       │
│                                                                                                  │
│   162 │   │   │   │   │   │   │   │   │   │      config=config,                                  │
│   163 │   │   │   │   │   │   │   │   │   │      config_class=config_class)                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   engine = DeepSpeedEngine(args=args,                                            │
│   166 │   │   │   │   │   │   │   │   │    model=model,                                          │
│   167 │   │   │   │   │   │   │   │   │    optimizer=optimizer,                                  │
│   168 │   │   │   │   │   │   │   │   │    model_parameters=model_parameters,                    │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/runtime/engine.py:266 in __init__   │
│                                                                                                  │
│    263 │   │   self.pipeline_parallelism = isinstance(model, PipelineModule)                     │
│    264 │   │                                                                                     │
│    265 │   │   # Configure distributed model                                                     │
│ ❱  266 │   │   self._configure_distributed_model(model)                                          │
│    267 │   │                                                                                     │
│    268 │   │   self._get_model_parameters()                                                      │
│    269                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1060 in           │
│ _configure_distributed_model                                                                     │
│                                                                                                  │
│   1057 │   │   │   │   module.set_deepspeed_parallelism()                                        │
│   1058 │   │                                                                                     │
│   1059 │   │   # Query the groups module to get information about various parallel groups        │
│ ❱ 1060 │   │   self.data_parallel_group = groups._get_data_parallel_group()                      │
│   1061 │   │   self.dp_world_size = groups._get_data_parallel_world_size()                       │
│   1062 │   │   self.mp_world_size = groups._get_model_parallel_world_size()                      │
│   1063 │   │   self.expert_parallel_group = groups._get_expert_parallel_group_dict()             │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/utils/groups.py:327 in              │
│ _get_data_parallel_group                                                                         │
│                                                                                                  │
│   324 │   if mpu is not None:                                                                    │
│   325 │   │   return mpu.get_data_parallel_group()                                               │
│   326 │   # Return the clone of dist world group                                                 │
│ ❱ 327 │   return _clone_world_group()                                                            │
│   328                                                                                            │
│   329                                                                                            │
│   330 def _get_broadcast_src_rank():                                                             │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/utils/groups.py:315 in              │
│ _clone_world_group                                                                               │
│                                                                                                  │
│   312 │   global _WORLD_GROUP                                                                    │
│   313 │   if _WORLD_GROUP is None:                                                               │
│   314 │   │   # If not cloned already, clone the world group                                     │
│ ❱ 315 │   │   _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))                  │
│   316 │   return _WORLD_GROUP                                                                    │
│   317                                                                                            │
│   318                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/comm/comm.py:179 in new_group       │
│                                                                                                  │
│   176 │   global cdb                                                                             │
│   177 │   assert cdb is not None and cdb.is_initialized(                                         │
│   178 │   ), 'DeepSpeed backend not set, please initialize it using init_process_group()'        │
│ ❱ 179 │   return cdb.new_group(ranks)                                                            │
│   180                                                                                            │
│   181                                                                                            │
│   182 def is_available() -> bool:                                                                │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/comm/torch.py:173 in new_group      │
│                                                                                                  │
│   170 │   │   return torch.distributed.get_backend(group=group)                                  │
│   171 │                                                                                          │
│   172 │   def new_group(self, ranks):                                                            │
│ ❱ 173 │   │   return torch.distributed.new_group(ranks)                                          │
│   174 │                                                                                          │
│   175 │   def get_global_rank(self, group, group_rank):                                          │
│   176 │   │   if hasattr(torch.distributed.distributed_c10d, "get_global_rank"):                 │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:3529 in │
│ new_group                                                                                        │
│                                                                                                  │
│   3526 │   else:                                                                                 │
│   3527 │   │   # Use store based barrier here since barrier() used a bunch of                    │
│   3528 │   │   # default devices and messes up NCCL internal state.                              │
│ ❱ 3529 │   │   _store_based_barrier(global_rank, default_store, timeout)                         │
│   3530 │                                                                                         │
│   3531 │   return pg                                                                             │
│   3532                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:446 in  │
│ _store_based_barrier                                                                             │
│                                                                                                  │
│    443 │   log_time = time.time()                                                                │
│    444 │   while worker_count != world_size:                                                     │
│    445 │   │   time.sleep(0.01)                                                                  │
│ ❱  446 │   │   worker_count = store.add(store_key, 0)                                            │
│    447 │   │                                                                                     │
│    448 │   │   # Print status periodically to keep track.                                        │
│    449 │   │   if timedelta(seconds=(time.time() - log_time)) > timedelta(seconds=10):           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Broken pipe
[2023-04-21 03:07:23,960] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 22609
[2023-04-21 03:07:23,960] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 22610
[2023-04-21 03:07:25,713] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/ftchat/bin/python3.9', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '5e-5', '--weight_decay', '0.1', '--num_train_epochs', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--deepspeed', '--output_dir', '/mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = 1

please Help us, thanks~

vpegasus avatar Apr 21 '23 03:04 vpegasus