EasyParallelLibrary icon indicating copy to clipboard operation
EasyParallelLibrary copied to clipboard

epl单机单卡和单机多卡训练step如何理解

Open SueeH opened this issue 2 years ago • 1 comments

单机单卡: 启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh image

单机双卡: 启动命令:TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh 1693045873752

代码修改了一下:去掉了last_step限制,数据集repeat=10,将txt改为py,可执行。 resnet_dp.txt

想请教下,这个如何理解呢?每个卡分别跑了10step?

SueeH avatar Sep 20 '23 09:09 SueeH

现在配置的batch_size是batch_size/GPU,global_batch_size = batch_size*gpu_num 数据量不变,增大gpu个数,一个epoch跑的step会线性减少。

adoda avatar Apr 24 '24 07:04 adoda