FlagAI icon indicating copy to clipboard operation
FlagAI copied to clipboard

[Question]:怎么使用deepspeed方式运行Aquila-pretrain,官方可以提供相关运行脚本吗

Open zt1556329495 opened this issue 2 years ago • 1 comments

Description

直接把env_type改成deepspeed会报如下错误: Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.44617509841918945 seconds Loading extension module utils... Time to load utils op: 0.3033127784729004 seconds Loading extension module utils... Time to load utils op: 0.7057700157165527 seconds Loading extension module utils... Time to load utils op: 0.40380406379699707 seconds Loading extension module utils... Time to load utils op: 0.4030745029449463 seconds Rank: 1 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 0 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 7 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 5 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 6 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 2 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 3 partition count [8, 8] and sizes[(911908864, False), (33280, False)] Rank: 4 partition count [8, 8] and sizes[(911908864, False), (33280, False)] [2023-07-10 09:01:23,550] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-07-10 09:01:23,551] [INFO] [utils.py:786:see_memory_usage] MA 14.35 GB Max_MA 14.35 GB CA 14.36 GB Max_CA 14 GB [2023-07-10 09:01:23,551] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 115.52 GB, percent = 7.6% /data/yhz/envs/zt_aquila/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126428 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126429 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126430 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126431 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126432 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126433 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 126436 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 126428 via 15, forcefully exitting via 9 WARNING:torch.distributed.elastic.multiprocessing.api:Unable to shutdown process 126430 via 15, forcefully exitting via 9

Alternatives

No response

zt1556329495 avatar Jul 10 '23 07:07 zt1556329495

同时也按照Aquila-chat里面的deepspeed的json文件配置了Aquila-pretrain的

zt1556329495 avatar Jul 10 '23 07:07 zt1556329495

现在还有这种问题吗,我这儿是可以运行的,会不会是显存不够?

BAAI-OpenPlatform avatar Jul 26 '23 01:07 BAAI-OpenPlatform