epl单机单卡和单机多卡训练step如何理解

Open SueeH opened this issue 2 years ago • 1 comments

单机单卡：启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh

单机双卡：启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh 1693045873752

代码修改了一下：去掉了last_step限制，数据集repeat=10，将txt改为py，可执行。 resnet_dp.txt

想请教下，这个如何理解呢？每个卡分别跑了10step？

Sep 20 '23 09:09 SueeH

现在配置的batch_size是batch_size/GPU，global_batch_size = batch_size*gpu_num 数据量不变，增大gpu个数，一个epoch跑的step会线性减少。

Apr 24 '24 07:04 adoda