数据并行数 × epoch数 = 真实的epoch数?
Discussed in https://github.com/hpcaitech/ColossalAI/discussions/2961
Originally posted by bobo0810 March 1, 2023 每个gpu上的dataloader都是完整的数据集,未做拆分。 即epoch=3 gpu=2时仅数据并行,模型实际上过了6遍数据集。
engine, train_dataloader, val_dataloader, _ = colossalai.initialize(
model,
optimizer,
criterion,
train_dataloader,
val_dataloader,
)
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: Data parallelism × epoch number = real epoch number?
Hi @bobo0810 I have replied to you in the discussion. Thanks.
Hi @binmakeswell
Hi @bobo0810 如果是PyTorch正常dataloader提供给Colossal,会被自动转成DistributedSampler。每个GPU各自处理一部分数据,共同完成整个数据集的epoch。epoch=3是3遍数据集。
Hi @binmakeswell 数据并行=2 epoch=3,理论上每张图片只出现3次,但从日志上看 每轮epoch出现2次,共出现6次。
启动命令colossalai run --nproc_per_node 2 train.py --epochs 3
代码

日志

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Hi @binmakeswell
Hi @bobo0810 If PyTorch's normal dataloader is provided to Colossal, it will be automatically converted to DistributedSampler. Each GPU processes a part of the data separately, and jointly completes the epoch of the entire data set. epoch=3 is a 3-pass dataset.
Hi @binmakeswell Data parallelism = 2 epoch = 3, theoretically each picture only appears 3 times, but from the log, each round of epoch appears 2 times, a total of 6 times.
Start command colossalai run --nproc_per_node 2 train.py --epochs 3
the code

log

遇到了一样的问题,同问
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Encountered the same problem, same question
并不会自动做DistributedSampler,需要手动处理下
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Does not automatically do DistributedSampler, it needs to be processed manually