ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

数据并行数 × epoch数 = 真实的epoch数?

Open bobo0810 opened this issue 2 years ago • 1 comments

Discussed in https://github.com/hpcaitech/ColossalAI/discussions/2961

Originally posted by bobo0810 March 1, 2023 每个gpu上的dataloader都是完整的数据集,未做拆分。 即epoch=3 gpu=2时仅数据并行,模型实际上过了6遍数据集。

engine, train_dataloader, val_dataloader, _ = colossalai.initialize(
        model,
        optimizer,
        criterion,
        train_dataloader,
        val_dataloader,
    )

bobo0810 avatar Mar 01 '23 12:03 bobo0810

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: Data parallelism × epoch number = real epoch number?

Issues-translate-bot avatar Mar 01 '23 12:03 Issues-translate-bot

Hi @bobo0810 I have replied to you in the discussion. Thanks.

binmakeswell avatar Mar 02 '23 13:03 binmakeswell

Hi @binmakeswell

Hi @bobo0810 如果是PyTorch正常dataloader提供给Colossal,会被自动转成DistributedSampler。每个GPU各自处理一部分数据,共同完成整个数据集的epoch。epoch=3是3遍数据集。

Hi @binmakeswell 数据并行=2 epoch=3,理论上每张图片只出现3次,但从日志上看 每轮epoch出现2次,共出现6次。

启动命令colossalai run --nproc_per_node 2 train.py --epochs 3

代码 image

日志 image

bobo0810 avatar Mar 09 '23 10:03 bobo0810

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Hi @binmakeswell

Hi @bobo0810 If PyTorch's normal dataloader is provided to Colossal, it will be automatically converted to DistributedSampler. Each GPU processes a part of the data separately, and jointly completes the epoch of the entire data set. epoch=3 is a 3-pass dataset.

Hi @binmakeswell Data parallelism = 2 epoch = 3, theoretically each picture only appears 3 times, but from the log, each round of epoch appears 2 times, a total of 6 times.

Start command colossalai run --nproc_per_node 2 train.py --epochs 3

the code image

log image

Issues-translate-bot avatar Mar 09 '23 10:03 Issues-translate-bot

遇到了一样的问题,同问

terrifyzhao avatar Mar 31 '23 10:03 terrifyzhao

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Encountered the same problem, same question

Issues-translate-bot avatar Mar 31 '23 10:03 Issues-translate-bot

并不会自动做DistributedSampler,需要手动处理下

terrifyzhao avatar Mar 31 '23 11:03 terrifyzhao

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Does not automatically do DistributedSampler, it needs to be processed manually

Issues-translate-bot avatar Mar 31 '23 11:03 Issues-translate-bot