Linly icon indicating copy to clipboard operation
Linly copied to clipboard

DeepSpeed ZeRO-3预训练

Open jamestch opened this issue 1 year ago • 3 comments

git clone TencentPretrain最新代码,在2*A100 80G GPU上进行DeepSpeed ZeRO-3预训练测试,执行脚本如下(参考:TencentPretrain 使用 DeepSpeed ZeRO-3 流水线并行训练): CUDA_VISIBLE_DEVICES=6,7 deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json
--pretrained_model_path models/llama-13b.bin
--dataset_path dataset.pt --spm_model_path /path_to_llama/tokenizer.model
--config_path models/llama/13b_config.json
--output_model_path models/output_model.llama_13.bin
--world_size 2 --data_processor lm --batch_size 2 --enable_zero3 不开启ZeRO-3正常,开启后报如下错误231819102-b5d4241e-7cbe-48bb-bdd9-f6bea5e2b249

jamestch avatar Apr 13 '23 16:04 jamestch

1681403106327 我修改了ZeRO-3的配置文件为deepspeed_zero3_config.json问题似乎解决了 CUDA_VISIBLE_DEVICES=6,7 deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --pretrained_model_path models/llama-13b.bin --dataset_path dataset.pt --spm_model_path /path_to_llama/tokenizer.model --config_path models/llama/13b_config.json --output_model_path models/output_model.llama_13.bin --world_size 2 --data_processor lm --batch_size 2 --enable_zero3

jamestch avatar Apr 13 '23 16:04 jamestch

咨询下,这个一般需要多少内存,我这边100多G,一直报内存不足。 RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4052879360 bytes. Error code 12 (Cannot allocate memory)

jonnyhe avatar May 03 '23 03:05 jonnyhe

我这边占用了150G内存,不知道为啥CPU占用还挺高的

AI-Study-Han avatar Jun 06 '23 05:06 AI-Study-Han