ColossalAI-Examples icon indicating copy to clipboard operation
ColossalAI-Examples copied to clipboard

[feature] New example: MAE pretraining on ImageNet 1000 dataset

Open ofey404 opened this issue 2 years ago • 9 comments

Colossal-AI implementation of MAE, arxiv.

As an example, we just cover the pretrain phase with ImageNet 1000 mini dataset. Helpers under subdir util/ are from facebookresearch/deit, under Apache License 2.0.

About the coding style

The coding style might be a little different from other examples like run_resnet_cifar10_with_engine.py, the configuration config/pretrain.py handled rich initialization logic and default values.

The DeiT and MAE code has a really complicated and intertwined initialization process. By making full use of Colossal-AI's dynamic python configuration ability, we can keep things simple enough for newcomers to understand.

ofey404 avatar Apr 12 '22 07:04 ofey404

Hi, as we want to support hybrid parallel MAE, can you try to support TP and PP as well? You can refer to the tutorial.

FrankLeeeee avatar Apr 13 '22 06:04 FrankLeeeee

Pure DP MAE with colossal has been finished by https://github.com/lucasliunju/mae-colossalai

binmakeswell avatar Apr 13 '22 06:04 binmakeswell

Okay, thank you for the help! I might be a beginner and I'm glad to check those links.

ofey404 avatar Apr 13 '22 06:04 ofey404

Hey everybody! I managed to support (limited) Tensor Parallelism, check it by running:

torchrun --standalone --nnodes 1 --nproc_per_node 4 main_pretrain.py --config ./config/pretrain_1d_tp2.py 

I tune the model inside models_mae_tp.py. More information in README.md

ofey404 avatar Apr 15 '22 06:04 ofey404

Add save & load model functionality, with colossalai.utils.checkpointing.

ofey404 avatar Apr 16 '22 03:04 ofey404

Hi @ofey404, thank you for your contribution! Would you please provide train logs in different parallelism settings?

yuxuan-lou avatar Apr 19 '22 08:04 yuxuan-lou

Hi @ofey404, thank you for your contribution! Would you please provide train logs in different parallelism settings?

Several epochs or a full run? A full 800 epochs run might take a long time to finish...

ofey404 avatar Apr 19 '22 08:04 ofey404

Hi @ofey404, thank you for your contribution! Would you please provide train logs in different parallelism settings?

Several epochs or a full run? A full 800 epochs run might take a long time to finish...

Perhaps you could consider providing some basic validation first, such as on CIFAR or a subset of ImageNet, eg. ImageNet100. As well as using the idle time of server at night to verify 30% epochs of ImageNet. So, we can provide this to users with confidence to some extent. Finally, you can complete the convergence verification.

binmakeswell avatar Apr 20 '22 07:04 binmakeswell

ImageNet100 on kaggle is 16 GB while ImageNet1000 I used is only 2 GB. CIFAR10 might be a good candidate for basic validation.

The problem is that, the original pretrain part of MAE doesn't contain validation, only the main training process has validation. So I'd better implement the main process too.

ofey404 avatar Apr 20 '22 08:04 ofey404