Chinese-LLaMA-Alpaca icon indicating copy to clipboard operation
Chinese-LLaMA-Alpaca copied to clipboard

deepspeed训练报错

Open ccdf1137 opened this issue 2 years ago • 14 comments

GPU:1台机器,8张A800,cuda:11.7 训练启动参数: lr=2e-4 lora_rank=8 lora_alpha=32 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

pretrained_model=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b chinese_tokenizer_path=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/models/chinese_llama_13b dataset_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cn_pretrain_data data_cache=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp per_device_train_batch_size=2 per_device_eval_batch_size=2 training_steps=100 gradient_accumulation_steps=8 output_dir=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/outputs

deepspeed_config_file=/workspace/fumengen/works/Chinese-LLaMA-Alpaca/ds_config.json

torchrun --nnodes 1 --nproc_per_node 8 scripts/run_clm_pt_with_peft.py
--deepspeed ${deepspeed_config_file}
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${chinese_tokenizer_path}
--dataset_dir ${dataset_dir}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.001
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--seed 3407
--fp16
--max_steps ${training_steps}
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 3
--save_steps 500
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--modules_to_save ${modules_to_save}
--lora_dropout ${lora_dropout}
--torch_dtype float16
--gradient_checkpointing
--ddp_find_unused_parameters False

Deepspeed_config参数

{ "bfloat16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5 }

报错bug: 05/17/2023 09:23:35 - INFO - main - training datasets-test_1187 has been loaded from disk 05/17/2023 09:23:35 - INFO - datasets.arrow_dataset - Caching indices mapping at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898/train/cache-edb7bf906025b5f5.arrow 05/17/2023 09:23:35 - INFO - datasets.arrow_dataset - Caching indices mapping at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898/train/cache-5cacbc341600c89e.arrow WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 124 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 127 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 123) of binary: /workspace/fumengen/vir_fme/bin/python3.10 Traceback (most recent call last): File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in sys.exit(main()) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/run_clm_pt_with_peft.py FAILED

Failures: [1]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 3 (local_rank: 3) exitcode : -7 (pid: 125) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 125 [2]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 6 (local_rank: 6) exitcode : -7 (pid: 128) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 128 [3]: time : 2023-05-17_09:23:42 host : 7b4985eaff35 rank : 7 (local_rank: 7) exitcode : -7 (pid: 129) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 129

ccdf1137 avatar May 17 '23 01:05 ccdf1137

把缓存/workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/data_tmp/test_898删掉让程序重新生成一次试试

airaria avatar May 17 '23 02:05 airaria

关注一下内存的变化,可能是内存不足

iMountTai avatar May 17 '23 02:05 iMountTai

@airaria 重新生成同样的报错

ccdf1137 avatar May 17 '23 02:05 ccdf1137

@airaria 重新生成同样的报错

现在不清楚是哪里的问题,建议通过实验准确定位。 比如,用小数据集、单卡有没有问题?

airaria avatar May 17 '23 02:05 airaria

@airaria 重新生成同样的报错

参考iMountTai的建议

关注一下内存的变化,可能是内存不足

airaria avatar May 17 '23 02:05 airaria

@iMountTai
d5425278d767937e5a7f8a718f5a5cd 8f7fea83f5663b5a6ea8d4eda0ea97b 这是运行时 GPU和CPU的状态

ccdf1137 avatar May 17 '23 06:05 ccdf1137

1T的内存确实不应该不足,建议您先完全按照我们的脚本设置运行7B的模型训练,然后再按照您的意愿修改相关设置

iMountTai avatar May 17 '23 06:05 iMountTai

@iMountTai 这是我下载合并的chinese-llama-7b的模型运行的,脚本没改只用1个gpu,torchrun --nnodes 1 --nproc_per_node 1 ,报错如下,如果使用多个gpu的话,报错还是跟上面的一样。 报错:

05/17/2023 17:02:39 - INFO - datasets.arrow_dataset - Process #7 will write at /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245_text/tokenized_00007_of_00008.arrow Traceback (most recent call last): File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 461, in main processed_dataset = datasets.load_from_disk(cache_path, keep_in_memory=False) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/load.py", line 1894, in load_from_disk raise FileNotFoundError( FileNotFoundError: Directory /workspace/fumengen/works/Chinese-LLaMA-Alpaca/data/cache/cache_7b/test_245 is neither a Dataset directory nor a DatasetDict directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 622, in main() File "/workspace/fumengen/works/Chinese-LLaMA-Alpaca/scripts/run_clm_pt_with_peft.py", line 468, in main tokenized_dataset = raw_dataset.map( File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/dataset_dict.py", line 851, in map { File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/dataset_dict.py", line 852, in k: dataset.map( File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 578, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 543, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3156, in map with Pool(len(kwargs_per_job)) as pool: File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/context.py", line 119, in Pool return Pool(processes, initializer, initargs, maxtasksperchild, File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/pool.py", line 191, in init self._setup_queues() File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/pool.py", line 346, in _setup_queues self._inqueue = self._ctx.SimpleQueue() File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/context.py", line 113, in SimpleQueue return SimpleQueue(ctx=self.get_context()) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/queues.py", line 344, in init self._rlock = ctx.Lock() File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/context.py", line 68, in Lock return Lock(ctx=self.get_context()) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/synchronize.py", line 168, in init SemLock.init(self, SEMAPHORE, 1, 1, ctx=ctx) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/multiprocess/synchronize.py", line 63, in init sl = self._semlock = _multiprocessing.SemLock( OSError: [Errno 28] No space left on device

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13913) of binary:

/workspace/fumengen/vir_fme/bin/python3.10 Traceback (most recent call last): File "/workspace/fumengen/vir_fme/bin/torchrun", line 8, in sys.exit(main()) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/workspace/fumengen/vir_fme/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm_pt_with_peft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-05-17_17:02:43 host : 7b4985eaff35 rank : 0 (local_rank: 0) exitcode : 1 (pid: 13913) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

ccdf1137 avatar May 17 '23 09:05 ccdf1137

data_cache_dir两次不一样?另外如果生成data cache时出错,建议先将错误的cache删除,然后重新生成新的cache

iMountTai avatar May 17 '23 14:05 iMountTai

我把docker 容器关了又重新开一下,就好了

ccdf1137 avatar May 18 '23 03:05 ccdf1137

我把docker 容器关了又重新开一下,就好了

老哥,batch size为1时单卡的显存能到多生呀,我用8块A100(40g)能起来吗?

dhhcj avatar May 18 '23 06:05 dhhcj

@dhhcj 用deepspeed stage=2 cpu内存换显存
f57a5aa2da56d068346cc3c392cc93e image 你看你内存容量够不够

ccdf1137 avatar May 19 '23 01:05 ccdf1137

stage

这块是怎么调整的呀,我没有在脚本和配置文件里发现相关参数

image image

dhhcj avatar May 22 '23 10:05 dhhcj

DeepSpeed官方文档有教程的,你这就行

ccdf1137 avatar May 23 '23 01:05 ccdf1137

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] avatar May 30 '23 22:05 github-actions[bot]

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

github-actions[bot] avatar Jun 03 '23 22:06 github-actions[bot]