DeepSpeedExamples
DeepSpeedExamples copied to clipboard
deepspeed-chat example script run error at step2
I run command: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node
when the process run into step 2
:
Launch command: bash /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
we encounter the following error, I could not find out what error it is:
[2023-04-21 02:53:26,570] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 02:53:26,627] [INFO] [runner.py:540:main] cmd = /opt/conda/envs/ftchat/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets openai/webgpt_comparisons stanfordnlp/SHP --data_split 2,4,4 --model_name_or_path facebook/opt-350m --num_padding_at_beginning 1 --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 5e-5 --weight_decay 0.1 --num_train_epochs 1 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --zero_stage 0 --deepspeed --output_dir /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m
[2023-04-21 02:53:29,046] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-04-21 02:53:29,046] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-04-21 02:53:29,046] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-04-21 02:53:29,046] [INFO] [launch.py:247:main] dist_world_size=2
[2023-04-21 02:53:29,046] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-04-21 02:53:32,349] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████| 685/685 [00:00<00:00, 197kB/s]
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████| 644/644 [00:00<00:00, 382kB/s]
Downloading (…)olve/main/vocab.json: 100%|█████████████████████████████████████████████| 899k/899k [00:00<00:00, 11.0MB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████| 456k/456k [00:00<00:00, 25.2MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████| 441/441 [00:00<00:00, 188kB/s]
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████| 663M/663M [00:01<00:00, 395MB/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 556.61it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 641.18it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b25c081aeeca3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 530.42it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/default-b25c081aeeca3652/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 565.96it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/Dahoas--synthetic-instruct-gptj-pairwise-0b2fd7bd9ea121cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 563.30it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/Dahoas___parquet/Dahoas--synthetic-instruct-gptj-pairwise-0b2fd7bd9ea121cb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 339.37it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/yitingxie___parquet/yitingxie--rlhf-reward-datasets-f2627438ff1fb9dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 665.45it/s]
Found cached dataset parquet (/mnt/disks/data-1/marvin/datasets_cache/yitingxie___parquet/yitingxie--rlhf-reward-datasets-f2627438ff1fb9dd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 736.49it/s]
Found cached dataset webgpt_comparisons (/mnt/disks/data-1/marvin/datasets_cache/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 359.59it/s]
Found cached dataset webgpt_comparisons (/mnt/disks/data-1/marvin/datasets_cache/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 396.06it/s]
Found cached dataset json (/mnt/disks/data-1/marvin/datasets_cache/stanfordnlp___json/stanfordnlp--SHP-10ead9e54f5a107d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 171.82it/s]
Found cached dataset json (/mnt/disks/data-1/marvin/datasets_cache/stanfordnlp___json/stanfordnlp--SHP-10ead9e54f5a107d/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 175.38it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/mawenjia/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Detected CUDA files, patching ldflags
Emitting ninja build file /home/mawenjia/.cache/torch_extensions/py39_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 2.921328544616699 seconds
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:348 in <module> │
│ │
│ 345 │
│ 346 │
│ 347 if __name__ == "__main__": │
│ ❱ 348 │ main() │
│ 349 │
│ │
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:271 in main │
│ │
│ 268 │ │ rm_model, args.weight_decay) │
│ 269 │ │
│ 270 │ AdamOptimizer = DeepSpeedCPUAdam if args.offload else FusedAdam │
│ ❱ 271 │ optimizer = AdamOptimizer(optimizer_grouped_parameters, │
│ 272 │ │ │ │ │ │ │ lr=args.learning_rate, │
│ 273 │ │ │ │ │ │ │ betas=(0.9, 0.95)) │
│ 274 │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/adam/fused_adam.py:71 in │
│ __init__ │
│ │
│ 68 │ │ self.adam_w_mode = 1 if adam_w_mode else 0 │
│ 69 │ │ self.set_grad_none = set_grad_none │
│ 70 │ │ │
│ ❱ 71 │ │ fused_adam_cuda = FusedAdamBuilder().load() │
│ 72 │ │ # Skip buffer │
│ 73 │ │ self._dummy_overflow_buf = get_accelerator().IntTensor([0]) │
│ 74 │ │ self.multi_tensor_adam = fused_adam_cuda.multi_tensor_adam │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py:445 in │
│ load │
│ │
│ 442 │ │ │ │
│ 443 │ │ │ return importlib.import_module(self.absolute_name()) │
│ 444 │ │ else: │
│ ❱ 445 │ │ │ return self.jit_load(verbose) │
│ 446 │ │
│ 447 │ def jit_load(self, verbose=True): │
│ 448 │ │ if not self.is_compatible(verbose): │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py:465 in │
│ jit_load │
│ │
│ 462 │ │ │ │ self.build_for_cpu = True │
│ 463 │ │ │
│ 464 │ │ self.jit_mode = True │
│ ❱ 465 │ │ from torch.utils.cpp_extension import load │
│ 466 │ │ │
│ 467 │ │ start_build = time.time() │
│ 468 │ │ sources = [self.deepspeed_src_path(path) for path in self.sources()] │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/cpp_extension.py:19 in <module> │
│ │
│ 16 import torch._appdirs │
│ 17 from .file_baton import FileBaton │
│ 18 from ._cpp_extension_versioner import ExtensionVersioner │
│ ❱ 19 from .hipify import hipify_python │
│ 20 from .hipify.hipify_python import GeneratedFileCleaner │
│ 21 from typing import Dict, List, Optional, Union, Tuple │
│ 22 from torch.torch_version import TorchVersion │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/hipify/hipify_python.py:34 in │
│ <module> │
│ │
│ 31 import os │
│ 32 │
│ 33 from . import constants │
│ ❱ 34 from .cuda_to_hip_mappings import CUDA_TO_HIP_MAPPINGS │
│ 35 from .cuda_to_hip_mappings import MATH_TRANSPILATIONS │
│ 36 │
│ 37 from typing import Dict, List, Iterator, Optional │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/utils/hipify/cuda_to_hip_mappings.py:34 │
│ in <module> │
│ │
│ 31 # As of ROCm 5.0, the version is found in rocm_version.h header file under /opt/rocm/inc │
│ 32 rocm_path = os.environ.get('ROCM_HOME') or os.environ.get('ROCM_PATH') or "/opt/rocm" │
│ 33 try: │
│ ❱ 34 │ rocm_path = subprocess.check_output(["hipconfig", "--rocmpath"]).decode("utf-8") │
│ 35 except subprocess.CalledProcessError: │
│ 36 │ print(f"Warning: hipconfig --rocmpath failed, assuming {rocm_path}") │
│ 37 except (FileNotFoundError, PermissionError): │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:424 in check_output │
│ │
│ 421 │ │ │ empty = b'' │
│ 422 │ │ kwargs['input'] = empty │
│ 423 │ │
│ ❱ 424 │ return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, │
│ 425 │ │ │ **kwargs).stdout │
│ 426 │
│ 427 │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:505 in run │
│ │
│ 502 │ │ kwargs['stdout'] = PIPE │
│ 503 │ │ kwargs['stderr'] = PIPE │
│ 504 │ │
│ ❱ 505 │ with Popen(*popenargs, **kwargs) as process: │
│ 506 │ │ try: │
│ 507 │ │ │ stdout, stderr = process.communicate(input, timeout=timeout) │
│ 508 │ │ except TimeoutExpired as exc: │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:951 in __init__ │
│ │
│ 948 │ │ │ │ │ self.stderr = io.TextIOWrapper(self.stderr, │
│ 949 │ │ │ │ │ │ │ encoding=encoding, errors=errors) │
│ 950 │ │ │ │
│ ❱ 951 │ │ │ self._execute_child(args, executable, preexec_fn, close_fds, │
│ 952 │ │ │ │ │ │ │ │ pass_fds, cwd, env, │
│ 953 │ │ │ │ │ │ │ │ startupinfo, creationflags, shell, │
│ 954 │ │ │ │ │ │ │ │ p2cread, p2cwrite, │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/subprocess.py:1754 in _execute_child │
│ │
│ 1751 │ │ │ │ │ │ │ for dir in os.get_exec_path(env)) │
│ 1752 │ │ │ │ │ fds_to_keep = set(pass_fds) │
│ 1753 │ │ │ │ │ fds_to_keep.add(errpipe_write) │
│ ❱ 1754 │ │ │ │ │ self.pid = _posixsubprocess.fork_exec( │
│ 1755 │ │ │ │ │ │ │ args, executable_list, │
│ 1756 │ │ │ │ │ │ │ close_fds, tuple(sorted(map(int, fds_to_keep))), │
│ 1757 │ │ │ │ │ │ │ cwd, env_list, │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: [Errno 12] Cannot allocate memory
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:348 in <module> │
│ │
│ 345 │
│ 346 │
│ 347 if __name__ == "__main__": │
│ ❱ 348 │ main() │
│ 349 │
│ │
│ /mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_mod │
│ el_finetuning/main.py:285 in main │
│ │
│ 282 │ │ num_training_steps=args.num_train_epochs * num_update_steps_per_epoch, │
│ 283 │ ) │
│ 284 │ │
│ ❱ 285 │ rm_model, optimizer, _, lr_scheduler = deepspeed.initialize( │
│ 286 │ │ model=rm_model, │
│ 287 │ │ optimizer=optimizer, │
│ 288 │ │ args=args, │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/__init__.py:165 in initialize │
│ │
│ 162 │ │ │ │ │ │ │ │ │ │ config=config, │
│ 163 │ │ │ │ │ │ │ │ │ │ config_class=config_class) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ engine = DeepSpeedEngine(args=args, │
│ 166 │ │ │ │ │ │ │ │ │ model=model, │
│ 167 │ │ │ │ │ │ │ │ │ optimizer=optimizer, │
│ 168 │ │ │ │ │ │ │ │ │ model_parameters=model_parameters, │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/runtime/engine.py:266 in __init__ │
│ │
│ 263 │ │ self.pipeline_parallelism = isinstance(model, PipelineModule) │
│ 264 │ │ │
│ 265 │ │ # Configure distributed model │
│ ❱ 266 │ │ self._configure_distributed_model(model) │
│ 267 │ │ │
│ 268 │ │ self._get_model_parameters() │
│ 269 │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1060 in │
│ _configure_distributed_model │
│ │
│ 1057 │ │ │ │ module.set_deepspeed_parallelism() │
│ 1058 │ │ │
│ 1059 │ │ # Query the groups module to get information about various parallel groups │
│ ❱ 1060 │ │ self.data_parallel_group = groups._get_data_parallel_group() │
│ 1061 │ │ self.dp_world_size = groups._get_data_parallel_world_size() │
│ 1062 │ │ self.mp_world_size = groups._get_model_parallel_world_size() │
│ 1063 │ │ self.expert_parallel_group = groups._get_expert_parallel_group_dict() │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/utils/groups.py:327 in │
│ _get_data_parallel_group │
│ │
│ 324 │ if mpu is not None: │
│ 325 │ │ return mpu.get_data_parallel_group() │
│ 326 │ # Return the clone of dist world group │
│ ❱ 327 │ return _clone_world_group() │
│ 328 │
│ 329 │
│ 330 def _get_broadcast_src_rank(): │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/utils/groups.py:315 in │
│ _clone_world_group │
│ │
│ 312 │ global _WORLD_GROUP │
│ 313 │ if _WORLD_GROUP is None: │
│ 314 │ │ # If not cloned already, clone the world group │
│ ❱ 315 │ │ _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size())) │
│ 316 │ return _WORLD_GROUP │
│ 317 │
│ 318 │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/comm/comm.py:179 in new_group │
│ │
│ 176 │ global cdb │
│ 177 │ assert cdb is not None and cdb.is_initialized( │
│ 178 │ ), 'DeepSpeed backend not set, please initialize it using init_process_group()' │
│ ❱ 179 │ return cdb.new_group(ranks) │
│ 180 │
│ 181 │
│ 182 def is_available() -> bool: │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/deepspeed/comm/torch.py:173 in new_group │
│ │
│ 170 │ │ return torch.distributed.get_backend(group=group) │
│ 171 │ │
│ 172 │ def new_group(self, ranks): │
│ ❱ 173 │ │ return torch.distributed.new_group(ranks) │
│ 174 │ │
│ 175 │ def get_global_rank(self, group, group_rank): │
│ 176 │ │ if hasattr(torch.distributed.distributed_c10d, "get_global_rank"): │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:3529 in │
│ new_group │
│ │
│ 3526 │ else: │
│ 3527 │ │ # Use store based barrier here since barrier() used a bunch of │
│ 3528 │ │ # default devices and messes up NCCL internal state. │
│ ❱ 3529 │ │ _store_based_barrier(global_rank, default_store, timeout) │
│ 3530 │ │
│ 3531 │ return pg │
│ 3532 │
│ │
│ /opt/conda/envs/ftchat/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:446 in │
│ _store_based_barrier │
│ │
│ 443 │ log_time = time.time() │
│ 444 │ while worker_count != world_size: │
│ 445 │ │ time.sleep(0.01) │
│ ❱ 446 │ │ worker_count = store.add(store_key, 0) │
│ 447 │ │ │
│ 448 │ │ # Print status periodically to keep track. │
│ 449 │ │ if timedelta(seconds=(time.time() - log_time)) > timedelta(seconds=10): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Broken pipe
[2023-04-21 03:07:23,960] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 22609
[2023-04-21 03:07:23,960] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 22610
[2023-04-21 03:07:25,713] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/envs/ftchat/bin/python3.9', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '5e-5', '--weight_decay', '0.1', '--num_train_epochs', '1', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '0', '--deepspeed', '--output_dir', '/mnt/disks/data-1/marvin/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m'] exits with return code = 1
please Help us, thanks~