FlagAI icon indicating copy to clipboard operation
FlagAI copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2)

Open ImGoodBai opened this issue 1 year ago • 8 comments

System Info

python -x  examples/Aquila/Aquila-chat/aquila_chat.py 
[2023-06-10 13:32:17,187] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
args list is ['--IGNORE_INDEX', '-100', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--already_fp16', 'False', '--batch_size', '1', '--bmt_async_load', 'False', '--bmt_cpu_offload', 'True', '--bmt_loss_scale', '1024.0', '--bmt_loss_scale_steps', '1024', '--bmt_lr_decay_style', 'cosine', '--bmt_pre_load', 'False', '--checkpoint_activations', 'False', '--clip_grad', '1.0', '--deepspeed_activation_checkpointing', 'False', '--deepspeed_config', './deepspeed.json', '--enable_flash_attn_models', 'False', '--enable_sft_conversations_dataset', 'False', '--enable_sft_conversations_dataset_v2', 'False', '--enable_sft_conversations_dataset_v3', 'False', '--enable_sft_dataset', 'False', '--enable_sft_dataset_dir', 'None', '--enable_sft_dataset_file', 'None', '--enable_sft_dataset_jsonl', 'False', '--enable_sft_dataset_text', 'False', '--enable_sft_dataset_val_file', 'None', '--enable_weighted_dataset_v2', 'False', '--env_type', 'bmtrain', '--epochs', '100', '--eps', '1e-08', '--eval_interval', '5000', '--experiment_name', 'test_experiment', '--fp16', 'True', '--gradient_accumulation_steps', '1', '--load_dir', 'None', '--load_optim', 'False', '--load_rng', 'False', '--load_type', 'latest', '--log_interval', '10', '--lora', 'False', '--lora_alpha', '16', '--lora_dropout', '0.05', '--lora_r', '8', '--lora_target_modules', "['wq', 'wv']", '--lr', '0.0002', '--model_name', 'test_model', '--model_parallel_size', '1', '--num_checkpoints', '1', '--pre_load_dir', 'None', '--pytorch_device', 'cuda', '--resume_dataset', 'False', '--save_dir', 'checkpoints_aquila', '--save_interval', '5000', '--save_optim', 'False', '--save_rng', 'False', '--seed', '1234', '--shuffle_dataset', 'True', '--skip_iters', '0', '--tensorboard', 'False', '--tensorboard_dir', 'tensorboard_summary', '--training_script', '/data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py', '--wandb', 'True', '--wandb_dir', './wandb', '--wandb_key', '3e614eb678063929b16c9b9aec557e2949d5a814', '--warm_up', '0.1', '--warm_up_iters', '0', '--warmup_start_lr', '0.0', '--weight_decay', '0.001', '--yaml_config', 'None']
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] not_call_launch: False
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] export NUM_NODES=1; export GPUS_PER_NODE=1; /home/good/anaconda3/envs/flagai/bin/python -m torch.distributed.launch --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 /data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py --IGNORE_INDEX -100 --adam_beta1 0.9 --adam_beta2 0.999 --already_fp16 False --batch_size 1 --bmt_async_load False --bmt_cpu_offload True --bmt_loss_scale 1024.0 --bmt_loss_scale_steps 1024 --bmt_lr_decay_style cosine --bmt_pre_load False --checkpoint_activations False --clip_grad 1.0 --deepspeed_activation_checkpointing False --deepspeed_config ./deepspeed.json --enable_flash_attn_models False --enable_sft_conversations_dataset False --enable_sft_conversations_dataset_v2 False --enable_sft_conversations_dataset_v3 False --enable_sft_dataset False --enable_sft_dataset_dir None --enable_sft_dataset_file None --enable_sft_dataset_jsonl False --enable_sft_dataset_text False --enable_sft_dataset_val_file None --enable_weighted_dataset_v2 False --env_type bmtrain --epochs 100 --eps 1e-08 --eval_interval 5000 --experiment_name test_experiment --fp16 True --gradient_accumulation_steps 1 --load_dir None --load_optim False --load_rng False --load_type latest --log_interval 10 --lora False --lora_alpha 16 --lora_dropout 0.05 --lora_r 8 --lora_target_modules ['wq', 'wv'] --lr 0.0002 --model_name test_model --model_parallel_size 1 --num_checkpoints 1 --pre_load_dir None --pytorch_device cuda --resume_dataset False --save_dir checkpoints_aquila --save_interval 5000 --save_optim False --save_rng False --seed 1234 --shuffle_dataset True --skip_iters 0 --tensorboard False --tensorboard_dir tensorboard_summary --training_script /data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py --wandb True --wandb_dir ./wandb --wandb_key 3e614eb678063929b16c9b9aec557e2949d5a814 --warm_up 0.1 --warm_up_iters 0 --warmup_start_lr 0.0 --weight_decay 0.001 --yaml_config None --not_call_launch
/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2023-06-10 13:32:19,445] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
usage: aquila_chat.py [-h] [--env_type ENV_TYPE] [--experiment_name EXPERIMENT_NAME] [--model_name MODEL_NAME] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--lr LR]
                      [--warmup_start_lr WARMUP_START_LR] [--seed SEED] [--fp16 FP16] [--pytorch_device PYTORCH_DEVICE] [--clip_grad CLIP_GRAD]
                      [--checkpoint_activations CHECKPOINT_ACTIVATIONS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--weight_decay WEIGHT_DECAY] [--eps EPS]
                      [--warm_up WARM_UP] [--warm_up_iters WARM_UP_ITERS] [--skip_iters SKIP_ITERS] [--log_interval LOG_INTERVAL] [--eval_interval EVAL_INTERVAL]
                      [--save_interval SAVE_INTERVAL] [--save_dir SAVE_DIR] [--load_dir LOAD_DIR] [--save_optim SAVE_OPTIM] [--save_rng SAVE_RNG] [--load_type LOAD_TYPE]
                      [--load_optim LOAD_OPTIM] [--load_rng LOAD_RNG] [--tensorboard TENSORBOARD] [--tensorboard_dir TENSORBOARD_DIR]
                      [--deepspeed_activation_checkpointing DEEPSPEED_ACTIVATION_CHECKPOINTING] [--num_checkpoints NUM_CHECKPOINTS] [--deepspeed_config DEEPSPEED_CONFIG]
                      [--model_parallel_size MODEL_PARALLEL_SIZE] [--training_script TRAINING_SCRIPT] [--hostfile HOSTFILE] [--master_ip MASTER_IP] [--master_port MASTER_PORT]
                      [--num_nodes NUM_NODES] [--num_gpus NUM_GPUS] [--not_call_launch] [--local_rank LOCAL_RANK] [--wandb WANDB] [--wandb_dir WANDB_DIR]
                      [--wandb_key WANDB_KEY] [--already_fp16 ALREADY_FP16] [--resume_dataset RESUME_DATASET] [--shuffle_dataset SHUFFLE_DATASET] [--adam_beta1 ADAM_BETA1]
                      [--adam_beta2 ADAM_BETA2] [--bmt_cpu_offload BMT_CPU_OFFLOAD] [--bmt_lr_decay_style BMT_LR_DECAY_STYLE] [--bmt_loss_scale BMT_LOSS_SCALE]
                      [--bmt_loss_scale_steps BMT_LOSS_SCALE_STEPS] [--lora LORA] [--lora_r LORA_R] [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT]
                      [--lora_target_modules LORA_TARGET_MODULES] [--yaml_config YAML_CONFIG] [--bmt_async_load BMT_ASYNC_LOAD] [--bmt_pre_load BMT_PRE_LOAD]
                      [--pre_load_dir PRE_LOAD_DIR] [--enable_sft_dataset_dir ENABLE_SFT_DATASET_DIR] [--enable_sft_dataset_file ENABLE_SFT_DATASET_FILE]
                      [--enable_sft_dataset_val_file ENABLE_SFT_DATASET_VAL_FILE] [--enable_sft_dataset ENABLE_SFT_DATASET] [--enable_sft_dataset_text ENABLE_SFT_DATASET_TEXT]
                      [--enable_sft_dataset_jsonl ENABLE_SFT_DATASET_JSONL] [--enable_sft_conversations_dataset ENABLE_SFT_CONVERSATIONS_DATASET]
                      [--enable_sft_conversations_dataset_v2 ENABLE_SFT_CONVERSATIONS_DATASET_V2] [--enable_sft_conversations_dataset_v3 ENABLE_SFT_CONVERSATIONS_DATASET_V3]
                      [--enable_weighted_dataset_v2 ENABLE_WEIGHTED_DATASET_V2] [--IGNORE_INDEX IGNORE_INDEX] [--enable_flash_attn_models ENABLE_FLASH_ATTN_MODELS]
aquila_chat.py: error: unrecognized arguments: --local-rank=0 wv]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 567151) of binary: /home/good/anaconda3/envs/flagai/bin/python
Traceback (most recent call last):
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-10_13:32:23
  host      : good-Precision-3650-Tower
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 567151)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as T5/AltCLIP, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

conda create --name flagai python=3.10 conda activate flagai

git clone https://github.com/FlagAI-Open/FlagAI.git pip install . pip install jsonlines pip install bminf

python setup.py install

python examples/Aquila/Aquila-chat/generate_chat.py

Expected behavior

OK

ImGoodBai avatar Jun 10 '23 06:06 ImGoodBai

Can you make sure you've used the latest version1.7.0 of flagai?

BAAI-OpenPlatform avatar Jun 10 '23 07:06 BAAI-OpenPlatform

Can you make sure you've used the latest version1.7.0 of flagai?

今天中文从github拉取的代码,

ImGoodBai avatar Jun 10 '23 08:06 ImGoodBai

最新的1.7.1代码该问题依然存在。

ImGoodBai avatar Jun 12 '23 01:06 ImGoodBai

启动方式请参考这个shell,aquila_chat.py 是被调用的。 https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/README.md#step-2-%E5%90%AF%E5%8A%A8%E8%AE%AD%E7%BB%83start-training

ftgreat avatar Jun 12 '23 02:06 ftgreat

python -x examples/Aquila/Aquila-chat/aquila_chat.py

python -x examples/Aquila/Aquila-chat/aquila_chat.py 不是这么调用的。需要按照这个教程来操作 https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/README.md#step-2-%E5%90%AF%E5%8A%A8%E8%AE%AD%E7%BB%83start-training

ftgreat avatar Jun 13 '23 09:06 ftgreat

same issue here

If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2023-06-10 13:32:19,445] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
usage: aquila_chat.py [-h] [--env_type ENV_TYPE] [--experiment_name EXPERIMENT_NAME] [--model_name MODEL_NAME] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--lr LR]
                      [--warmup_start_lr WARMUP_START_LR] [--seed SEED] [--fp16 FP16] [--pytorch_device PYTORCH_DEVICE] [--clip_grad CLIP_GRAD]
                      [--checkpoint_activations CHECKPOINT_ACTIVATIONS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--weight_decay WEIGHT_DECAY] [--eps EPS]
                      [--warm_up WARM_UP] [--warm_up_iters WARM_UP_ITERS] [--skip_iters SKIP_ITERS] [--log_interval LOG_INTERVAL] [--eval_interval EVAL_INTERVAL]
                      [--save_interval SAVE_INTERVAL] [--save_dir SAVE_DIR] [--load_dir LOAD_DIR] [--save_optim SAVE_OPTIM] [--save_rng SAVE_RNG] [--load_type LOAD_TYPE]
                      [--load_optim LOAD_OPTIM] [--load_rng LOAD_RNG] [--tensorboard TENSORBOARD] [--tensorboard_dir TENSORBOARD_DIR]
                      [--deepspeed_activation_checkpointing DEEPSPEED_ACTIVATION_CHECKPOINTING] [--num_checkpoints NUM_CHECKPOINTS] [--deepspeed_config DEEPSPEED_CONFIG]
                      [--model_parallel_size MODEL_PARALLEL_SIZE] [--training_script TRAINING_SCRIPT] [--hostfile HOSTFILE] [--master_ip MASTER_IP] [--master_port MASTER_PORT]
                      [--num_nodes NUM_NODES] [--num_gpus NUM_GPUS] [--not_call_launch] [--local_rank LOCAL_RANK] [--wandb WANDB] [--wandb_dir WANDB_DIR]
                      [--wandb_key WANDB_KEY] [--already_fp16 ALREADY_FP16] [--resume_dataset RESUME_DATASET] [--shuffle_dataset SHUFFLE_DATASET] [--adam_beta1 ADAM_BETA1]
                      [--adam_beta2 ADAM_BETA2] [--bmt_cpu_offload BMT_CPU_OFFLOAD] [--bmt_lr_decay_style BMT_LR_DECAY_STYLE] [--bmt_loss_scale BMT_LOSS_SCALE]
                      [--bmt_loss_scale_steps BMT_LOSS_SCALE_STEPS] [--lora LORA] [--lora_r LORA_R] [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT]
                      [--lora_target_modules LORA_TARGET_MODULES] [--yaml_config YAML_CONFIG] [--bmt_async_load BMT_ASYNC_LOAD] [--bmt_pre_load BMT_PRE_LOAD]
                      [--pre_load_dir PRE_LOAD_DIR] [--enable_sft_dataset_dir ENABLE_SFT_DATASET_DIR] [--enable_sft_dataset_file ENABLE_SFT_DATASET_FILE]
                      [--enable_sft_dataset_val_file ENABLE_SFT_DATASET_VAL_FILE] [--enable_sft_dataset ENABLE_SFT_DATASET] [--enable_sft_dataset_text ENABLE_SFT_DATASET_TEXT]
                      [--enable_sft_dataset_jsonl ENABLE_SFT_DATASET_JSONL] [--enable_sft_conversations_dataset ENABLE_SFT_CONVERSATIONS_DATASET]
                      [--enable_sft_conversations_dataset_v2 ENABLE_SFT_CONVERSATIONS_DATASET_V2] [--enable_sft_conversations_dataset_v3 ENABLE_SFT_CONVERSATIONS_DATASET_V3]
                      [--enable_weighted_dataset_v2 ENABLE_WEIGHTED_DATASET_V2] [--IGNORE_INDEX IGNORE_INDEX] [--enable_flash_attn_models ENABLE_FLASH_ATTN_MODELS]
aquila_chat.py: error: unrecognized arguments: --local-rank=0 wv]

i guess it is because the version difference in pytorch, from the log it seems we passed --local-rank to aquila_chat.py, but the code in: https://github.com/FlagAI-Open/FlagAI/blob/36c083fd44a530a1b392d207b408eafc656bbc52/flagai/env_args.py#L149 shows the parameters are expected to be passed by using --local_rank

@ftgreat

csyourui avatar Jun 14 '23 07:06 csyourui

请给下hostfile

ftgreat avatar Jun 14 '23 07:06 ftgreat

hostfile:

172.17.0.6 slots=8

csyourui avatar Jun 15 '23 02:06 csyourui

先关闭,如有问题重新打开issue,谢谢

ftgreat avatar Jun 22 '23 11:06 ftgreat

搞定了吗,降低torch版本,local-rank不会报错了,还有个wv

ylhou avatar Jun 28 '23 15:06 ylhou

搞定了吗,降低torch版本,local-rank不会报错了,还有个wv

降低到那个版本?我使用1.13.1 也还是有问题

vincentami avatar Oct 18 '23 01:10 vincentami