deepspeed训练报错
GPU:1台机器,8张A800,cuda:11.7 训练启动参数: lr=2e-4 lora_rank=8 lora_alpha=32 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05
pretrained_model=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b chinese_tokenizer_path=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b dataset_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cn_pretrain_data data_cache=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp per_device_train_batch_size=2 per_device_eval_batch_size=2 training_steps=100 gradient_accumulation_steps=8 output_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/outputs
deepspeed_config_file=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/ds_config.json
torchrun --nnodes 1 --nproc_per_node 8 scripts/run_clm_pt_with_peft.py
--deepspeed ${deepspeed_config_file}
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${chinese_tokenizer_path}
--dataset_dir ${dataset_dir}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.001
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--seed 3407
--fp16
--max_steps ${training_steps}
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--save_steps 500
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--modules_to_save ${modules_to_save}
--lora_dropout ${lora_dropout}
--torch_dtype float16
--gradient_checkpointing
--ddp_find_unused_parameters False
Deepspeed_config参数
{ "bfloat16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5 }
报错bug:
05/17/2023 09:23:35 - INFO - main - training datasets-test_1187 has been loaded from disk
05/17/2023 09:23:35 - INFO - datasets.arrow_dataset - Caching indices mapping at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898/train/cache-edb7bf906025b5f5.arrow
05/17/2023 09:23:35 - INFO - datasets.arrow_dataset - Caching indices mapping at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898/train/cache-5cacbc341600c89e.arrow
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 124 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 127 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 123) of binary: /workspace/fumengen/vir_fme/bin/python3.10
Traceback (most recent call last):
File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in
scripts/run_clm_pt_with_peft.py FAILED
Failures: [1]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 3 (local_rank: 3) exitcode : -7 (pid: 125) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 125 [2]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 6 (local_rank: 6) exitcode : -7 (pid: 128) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 128 [3]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 7 (local_rank: 7) exitcode : -7 (pid: 129) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 129
把缓存/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898删掉让程序重新生成一次试试
关注一下内存的变化,可能是内存不足
@airaria 重新生成同样的报错
@airaria 重新生成同样的报错
现在不清楚是哪里的问题,建议通过实验准确定位。 比如,用小数据集、单卡有没有问题?
@airaria 重新生成同样的报错
参考iMountTai的建议
关注一下内存的变化,可能是内存不足
@iMountTai
这是运行时 GPU和CPU的状态
1T的内存确实不应该不足,建议您先完全按照我们的脚本设置运行7B的模型训练,然后再按照您的意愿修改相关设置
@iMountTai 这是我下载合并的chinese-llama-7b的模型运行的,脚本没改只用1个gpu,torchrun --nnodes 1 --nproc_per_node 1 ,报错如下,如果使用多个gpu的话,报错还是跟上面的一样。 报错:
05/17/2023 17:02:39 - INFO - datasets.arrow_dataset - Process #7 will write at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245_text/tokenized_00007_of_00008.arrow
Traceback (most recent call last):
File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 461, in main
processed_dataset = datasets.load_from_disk(cache_path, keep_in_memory=False)
File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/load.py", line 1894, in load_from_disk
raise FileNotFoundError(
FileNotFoundError: Directory /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245 is neither a Dataset directory nor a DatasetDict directory.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 622, in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13913) of binary:
/workspace/fumengen/vir_fme/bin/python3.10
Traceback (most recent call last):
File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in
run_clm_pt_with_peft.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-05-17_17:02:43 host : 7b4985eaff35 rank : 0 (local_rank: 0) exitcode : 1 (pid: 13913) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
data_cache_dir两次不一样?另外如果生成data cache时出错,建议先将错误的cache删除,然后重新生成新的cache
我把docker 容器关了又重新开一下,就好了
我把docker 容器关了又重新开一下,就好了
老哥,batch size为1时单卡的显存能到多生呀,我用8块A100(40g)能起来吗?
@dhhcj
用deepspeed stage=2 cpu内存换显存
你看你内存容量够不够
stage
这块是怎么调整的呀,我没有在脚本和配置文件里发现相关参数
DeepSpeed官方文档有教程的,你这就行
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.