CPM-Bee icon indicating copy to clipboard operation
CPM-Bee copied to clipboard

finetue_cpm_bee.py 当前支持模型并行训练吗,传参应该怎么设置呢?

Open diaojunxian opened this issue 1 year ago • 0 comments

当前运行机器有4张3090卡,但是通过指令运行增量微调的时候,报错;

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12345 finetune_cpm_bee.py --use-delta --model-config /home/CPM-Bee/src/config/cpm-bee-10b.json xxxxx
OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 23.70 GiB total capacity; 22.86 GiB already allocated; 64.44 MiB free; 23.16 GiB reserved in total by 
PyTorch) If reserved memory is

看起来当前的并发是基于 ddp 的数据并发运行机制,不清楚,是否当前 finetue_cpm_bee.py 支持模型并发的运行训练机制?

diaojunxian avatar Jun 09 '23 06:06 diaojunxian