LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

[BUG]

Open NingBoHao opened this issue 2 years ago • 1 comments

(lmflow) PS E:\LMFlow-main\LMFlow-main> bash ./scripts/run_finetune.sh [2023-04-24 19:29:27,417] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-24 19:29:27,440] [INFO] [runner.py:540:main] cmd = D:\UserSoftware\Anaconda3\envs\lmflow\python.exe -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path E:/LMFlow-main/LMFlow-main/data/alpaca/train --output_dir E:/LMFlow-main/LMFlow-main/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1 [2023-04-24 19:29:29,047] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-24 19:29:29,047] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-24 19:29:29,047] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-24 19:29:29,047] [INFO] [launch.py:247:main] dist_world_size=1 [2023-04-24 19:29:29,047] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 Traceback (most recent call last): File "E:\LMFlow-main\LMFlow-main\examples\finetune.py", line 60, in main() File "E:\LMFlow-main\LMFlow-main\examples\finetune.py", line 43, in main model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses() File "D:\UserSoftware\Anaconda3\envs\lmflow\lib\site-packages\transformers\hf_argparser.py", line 332, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 110, in init File "D:\UserSoftware\Anaconda3\envs\lmflow\lib\site-packages\transformers\training_args.py", line 1222, in post_init raise ValueError( ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0 [2023-04-24 19:29:32,060] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 16984 [2023-04-24 19:29:32,067] [ERROR] [launch.py:434:sigkill_handler] ['D:\UserSoftware\Anaconda3\envs\lmflow\python.exe', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', 'E:/LMFlow-main/LMFlow-main/data/alpaca/train', '--output_dir', 'E:/LMFlow-main/LMFlow-main/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

NingBoHao avatar Apr 24 '23 11:04 NingBoHao

The error is caused by ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0.

It could be resolved by setting --fp16 instead of --bf16

shizhediao avatar Apr 24 '23 14:04 shizhediao

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

shizhediao avatar May 15 '23 00:05 shizhediao