FlagAI
FlagAI copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2)
System Info
python -x examples/Aquila/Aquila-chat/aquila_chat.py
[2023-06-10 13:32:17,187] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
args list is ['--IGNORE_INDEX', '-100', '--adam_beta1', '0.9', '--adam_beta2', '0.999', '--already_fp16', 'False', '--batch_size', '1', '--bmt_async_load', 'False', '--bmt_cpu_offload', 'True', '--bmt_loss_scale', '1024.0', '--bmt_loss_scale_steps', '1024', '--bmt_lr_decay_style', 'cosine', '--bmt_pre_load', 'False', '--checkpoint_activations', 'False', '--clip_grad', '1.0', '--deepspeed_activation_checkpointing', 'False', '--deepspeed_config', './deepspeed.json', '--enable_flash_attn_models', 'False', '--enable_sft_conversations_dataset', 'False', '--enable_sft_conversations_dataset_v2', 'False', '--enable_sft_conversations_dataset_v3', 'False', '--enable_sft_dataset', 'False', '--enable_sft_dataset_dir', 'None', '--enable_sft_dataset_file', 'None', '--enable_sft_dataset_jsonl', 'False', '--enable_sft_dataset_text', 'False', '--enable_sft_dataset_val_file', 'None', '--enable_weighted_dataset_v2', 'False', '--env_type', 'bmtrain', '--epochs', '100', '--eps', '1e-08', '--eval_interval', '5000', '--experiment_name', 'test_experiment', '--fp16', 'True', '--gradient_accumulation_steps', '1', '--load_dir', 'None', '--load_optim', 'False', '--load_rng', 'False', '--load_type', 'latest', '--log_interval', '10', '--lora', 'False', '--lora_alpha', '16', '--lora_dropout', '0.05', '--lora_r', '8', '--lora_target_modules', "['wq', 'wv']", '--lr', '0.0002', '--model_name', 'test_model', '--model_parallel_size', '1', '--num_checkpoints', '1', '--pre_load_dir', 'None', '--pytorch_device', 'cuda', '--resume_dataset', 'False', '--save_dir', 'checkpoints_aquila', '--save_interval', '5000', '--save_optim', 'False', '--save_rng', 'False', '--seed', '1234', '--shuffle_dataset', 'True', '--skip_iters', '0', '--tensorboard', 'False', '--tensorboard_dir', 'tensorboard_summary', '--training_script', '/data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py', '--wandb', 'True', '--wandb_dir', './wandb', '--wandb_key', '3e614eb678063929b16c9b9aec557e2949d5a814', '--warm_up', '0.1', '--warm_up_iters', '0', '--warmup_start_lr', '0.0', '--weight_decay', '0.001', '--yaml_config', 'None']
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] not_call_launch: False
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-10 13:32:17,559] [INFO] [logger.py:85:log_dist] [Rank -1] export NUM_NODES=1; export GPUS_PER_NODE=1; /home/good/anaconda3/envs/flagai/bin/python -m torch.distributed.launch --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 17750 /data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py --IGNORE_INDEX -100 --adam_beta1 0.9 --adam_beta2 0.999 --already_fp16 False --batch_size 1 --bmt_async_load False --bmt_cpu_offload True --bmt_loss_scale 1024.0 --bmt_loss_scale_steps 1024 --bmt_lr_decay_style cosine --bmt_pre_load False --checkpoint_activations False --clip_grad 1.0 --deepspeed_activation_checkpointing False --deepspeed_config ./deepspeed.json --enable_flash_attn_models False --enable_sft_conversations_dataset False --enable_sft_conversations_dataset_v2 False --enable_sft_conversations_dataset_v3 False --enable_sft_dataset False --enable_sft_dataset_dir None --enable_sft_dataset_file None --enable_sft_dataset_jsonl False --enable_sft_dataset_text False --enable_sft_dataset_val_file None --enable_weighted_dataset_v2 False --env_type bmtrain --epochs 100 --eps 1e-08 --eval_interval 5000 --experiment_name test_experiment --fp16 True --gradient_accumulation_steps 1 --load_dir None --load_optim False --load_rng False --load_type latest --log_interval 10 --lora False --lora_alpha 16 --lora_dropout 0.05 --lora_r 8 --lora_target_modules ['wq', 'wv'] --lr 0.0002 --model_name test_model --model_parallel_size 1 --num_checkpoints 1 --pre_load_dir None --pytorch_device cuda --resume_dataset False --save_dir checkpoints_aquila --save_interval 5000 --save_optim False --save_rng False --seed 1234 --shuffle_dataset True --skip_iters 0 --tensorboard False --tensorboard_dir tensorboard_summary --training_script /data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py --wandb True --wandb_dir ./wandb --wandb_key 3e614eb678063929b16c9b9aec557e2949d5a814 --warm_up 0.1 --warm_up_iters 0 --warmup_start_lr 0.0 --weight_decay 0.001 --yaml_config None --not_call_launch
/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-06-10 13:32:19,445] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
usage: aquila_chat.py [-h] [--env_type ENV_TYPE] [--experiment_name EXPERIMENT_NAME] [--model_name MODEL_NAME] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--lr LR]
[--warmup_start_lr WARMUP_START_LR] [--seed SEED] [--fp16 FP16] [--pytorch_device PYTORCH_DEVICE] [--clip_grad CLIP_GRAD]
[--checkpoint_activations CHECKPOINT_ACTIVATIONS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--weight_decay WEIGHT_DECAY] [--eps EPS]
[--warm_up WARM_UP] [--warm_up_iters WARM_UP_ITERS] [--skip_iters SKIP_ITERS] [--log_interval LOG_INTERVAL] [--eval_interval EVAL_INTERVAL]
[--save_interval SAVE_INTERVAL] [--save_dir SAVE_DIR] [--load_dir LOAD_DIR] [--save_optim SAVE_OPTIM] [--save_rng SAVE_RNG] [--load_type LOAD_TYPE]
[--load_optim LOAD_OPTIM] [--load_rng LOAD_RNG] [--tensorboard TENSORBOARD] [--tensorboard_dir TENSORBOARD_DIR]
[--deepspeed_activation_checkpointing DEEPSPEED_ACTIVATION_CHECKPOINTING] [--num_checkpoints NUM_CHECKPOINTS] [--deepspeed_config DEEPSPEED_CONFIG]
[--model_parallel_size MODEL_PARALLEL_SIZE] [--training_script TRAINING_SCRIPT] [--hostfile HOSTFILE] [--master_ip MASTER_IP] [--master_port MASTER_PORT]
[--num_nodes NUM_NODES] [--num_gpus NUM_GPUS] [--not_call_launch] [--local_rank LOCAL_RANK] [--wandb WANDB] [--wandb_dir WANDB_DIR]
[--wandb_key WANDB_KEY] [--already_fp16 ALREADY_FP16] [--resume_dataset RESUME_DATASET] [--shuffle_dataset SHUFFLE_DATASET] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--bmt_cpu_offload BMT_CPU_OFFLOAD] [--bmt_lr_decay_style BMT_LR_DECAY_STYLE] [--bmt_loss_scale BMT_LOSS_SCALE]
[--bmt_loss_scale_steps BMT_LOSS_SCALE_STEPS] [--lora LORA] [--lora_r LORA_R] [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT]
[--lora_target_modules LORA_TARGET_MODULES] [--yaml_config YAML_CONFIG] [--bmt_async_load BMT_ASYNC_LOAD] [--bmt_pre_load BMT_PRE_LOAD]
[--pre_load_dir PRE_LOAD_DIR] [--enable_sft_dataset_dir ENABLE_SFT_DATASET_DIR] [--enable_sft_dataset_file ENABLE_SFT_DATASET_FILE]
[--enable_sft_dataset_val_file ENABLE_SFT_DATASET_VAL_FILE] [--enable_sft_dataset ENABLE_SFT_DATASET] [--enable_sft_dataset_text ENABLE_SFT_DATASET_TEXT]
[--enable_sft_dataset_jsonl ENABLE_SFT_DATASET_JSONL] [--enable_sft_conversations_dataset ENABLE_SFT_CONVERSATIONS_DATASET]
[--enable_sft_conversations_dataset_v2 ENABLE_SFT_CONVERSATIONS_DATASET_V2] [--enable_sft_conversations_dataset_v3 ENABLE_SFT_CONVERSATIONS_DATASET_V3]
[--enable_weighted_dataset_v2 ENABLE_WEIGHTED_DATASET_V2] [--IGNORE_INDEX IGNORE_INDEX] [--enable_flash_attn_models ENABLE_FLASH_ATTN_MODELS]
aquila_chat.py: error: unrecognized arguments: --local-rank=0 wv]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 567151) of binary: /home/good/anaconda3/envs/flagai/bin/python
Traceback (most recent call last):
File "/home/good/anaconda3/envs/flagai/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/good/anaconda3/envs/flagai/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/good/anaconda3/envs/flagai/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/data/good/FlagAI/examples/Aquila/Aquila-chat/aquila_chat.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-10_13:32:23
host : good-Precision-3650-Tower
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 567151)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as T5/AltCLIP, ...) - [ ] My own task or dataset (give details below)
Reproduction
conda create --name flagai python=3.10 conda activate flagai
git clone https://github.com/FlagAI-Open/FlagAI.git pip install . pip install jsonlines pip install bminf
python setup.py install
python examples/Aquila/Aquila-chat/generate_chat.py
Expected behavior
OK
Can you make sure you've used the latest version1.7.0 of flagai?
Can you make sure you've used the latest version1.7.0 of flagai?
今天中文从github拉取的代码,
最新的1.7.1代码该问题依然存在。
启动方式请参考这个shell,aquila_chat.py 是被调用的。 https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/README.md#step-2-%E5%90%AF%E5%8A%A8%E8%AE%AD%E7%BB%83start-training
python -x examples/Aquila/Aquila-chat/aquila_chat.py
python -x examples/Aquila/Aquila-chat/aquila_chat.py 不是这么调用的。需要按照这个教程来操作 https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/README.md#step-2-%E5%90%AF%E5%8A%A8%E8%AE%AD%E7%BB%83start-training
same issue here
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2023-06-10 13:32:19,445] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
usage: aquila_chat.py [-h] [--env_type ENV_TYPE] [--experiment_name EXPERIMENT_NAME] [--model_name MODEL_NAME] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--lr LR]
[--warmup_start_lr WARMUP_START_LR] [--seed SEED] [--fp16 FP16] [--pytorch_device PYTORCH_DEVICE] [--clip_grad CLIP_GRAD]
[--checkpoint_activations CHECKPOINT_ACTIVATIONS] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--weight_decay WEIGHT_DECAY] [--eps EPS]
[--warm_up WARM_UP] [--warm_up_iters WARM_UP_ITERS] [--skip_iters SKIP_ITERS] [--log_interval LOG_INTERVAL] [--eval_interval EVAL_INTERVAL]
[--save_interval SAVE_INTERVAL] [--save_dir SAVE_DIR] [--load_dir LOAD_DIR] [--save_optim SAVE_OPTIM] [--save_rng SAVE_RNG] [--load_type LOAD_TYPE]
[--load_optim LOAD_OPTIM] [--load_rng LOAD_RNG] [--tensorboard TENSORBOARD] [--tensorboard_dir TENSORBOARD_DIR]
[--deepspeed_activation_checkpointing DEEPSPEED_ACTIVATION_CHECKPOINTING] [--num_checkpoints NUM_CHECKPOINTS] [--deepspeed_config DEEPSPEED_CONFIG]
[--model_parallel_size MODEL_PARALLEL_SIZE] [--training_script TRAINING_SCRIPT] [--hostfile HOSTFILE] [--master_ip MASTER_IP] [--master_port MASTER_PORT]
[--num_nodes NUM_NODES] [--num_gpus NUM_GPUS] [--not_call_launch] [--local_rank LOCAL_RANK] [--wandb WANDB] [--wandb_dir WANDB_DIR]
[--wandb_key WANDB_KEY] [--already_fp16 ALREADY_FP16] [--resume_dataset RESUME_DATASET] [--shuffle_dataset SHUFFLE_DATASET] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--bmt_cpu_offload BMT_CPU_OFFLOAD] [--bmt_lr_decay_style BMT_LR_DECAY_STYLE] [--bmt_loss_scale BMT_LOSS_SCALE]
[--bmt_loss_scale_steps BMT_LOSS_SCALE_STEPS] [--lora LORA] [--lora_r LORA_R] [--lora_alpha LORA_ALPHA] [--lora_dropout LORA_DROPOUT]
[--lora_target_modules LORA_TARGET_MODULES] [--yaml_config YAML_CONFIG] [--bmt_async_load BMT_ASYNC_LOAD] [--bmt_pre_load BMT_PRE_LOAD]
[--pre_load_dir PRE_LOAD_DIR] [--enable_sft_dataset_dir ENABLE_SFT_DATASET_DIR] [--enable_sft_dataset_file ENABLE_SFT_DATASET_FILE]
[--enable_sft_dataset_val_file ENABLE_SFT_DATASET_VAL_FILE] [--enable_sft_dataset ENABLE_SFT_DATASET] [--enable_sft_dataset_text ENABLE_SFT_DATASET_TEXT]
[--enable_sft_dataset_jsonl ENABLE_SFT_DATASET_JSONL] [--enable_sft_conversations_dataset ENABLE_SFT_CONVERSATIONS_DATASET]
[--enable_sft_conversations_dataset_v2 ENABLE_SFT_CONVERSATIONS_DATASET_V2] [--enable_sft_conversations_dataset_v3 ENABLE_SFT_CONVERSATIONS_DATASET_V3]
[--enable_weighted_dataset_v2 ENABLE_WEIGHTED_DATASET_V2] [--IGNORE_INDEX IGNORE_INDEX] [--enable_flash_attn_models ENABLE_FLASH_ATTN_MODELS]
aquila_chat.py: error: unrecognized arguments: --local-rank=0 wv]
i guess it is because the version difference in pytorch
, from the log it seems we passed --local-rank
to aquila_chat.py
, but the code in:
https://github.com/FlagAI-Open/FlagAI/blob/36c083fd44a530a1b392d207b408eafc656bbc52/flagai/env_args.py#L149
shows the parameters are expected to be passed by using --local_rank
@ftgreat
请给下hostfile
hostfile:
172.17.0.6 slots=8
先关闭,如有问题重新打开issue,谢谢
搞定了吗,降低torch版本,local-rank不会报错了,还有个wv
搞定了吗,降低torch版本,local-rank不会报错了,还有个wv
降低到那个版本?我使用1.13.1 也还是有问题