ColossalAI
ColossalAI copied to clipboard
[PROPOSAL]: Gemini Decouples ChunkManager with the Model.
Proposal
动机
现在ChunkManager是挂在一个pytorch model中。这样做有限制
- 无法处理多个model用Gemini训练,以为不同模型对异构内存使用会互相干扰,导致每个模型在warmup采集的信息不具备指导意义。
- 更重要的是,和Pytorch的使用方式差异。如下optim定义必须传入一个model作为参数。
model = zero_model_wrapper(model, zero_stage, gemini_config)
optimizer = zero_optim_wrapper(**model**, optimizer, optim_config=optim_config)
而Pytorch Optimizer初始化和model没有任何关系(https://pytorch.org/docs/stable/optim.html), 尽管大多数使用场景,optimizer构建时使用model.parameters(),但比如下面代码第二种方式目前Gemini就不能支持。
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr=0.0001)
目标
- 解耦ChunkManager和Model。 让ChunkManager变成一个独立模块,多个模型可以把参数注册到同一个ChunkManager中。这样可实现多个模型同时训练的异构内存管理。这可能对ChatGPT有所帮助。
- 解耦model和Gemini optimizer初始化,实现和PyTorch接口真正的统一。
======== 下面由ChatGPT翻译 ======
Motivations
Currently, ChunkManager is attached to a PyTorch model instance. This has limitations:
- Can't handle multiple models training with Gemini because different models' heterogeneous memory usage will interfere with each other, resulting in the information collected during warm-up not having any guiding significance for each model.
- More importantly, it deviates from the usage of PyTorch. For example, the definition of optimizer must pass a model as a parameter as shown below:
model = zero_model_wrapper(model, zero_stage, gemini_config)
optimizer = zero_optim_wrapper(**model**, optimizer, optim_config=optim_config)
While PyTorch Optimizer initialization has no relation to the model (https://pytorch.org/docs/stable/optim.html), even though in most use cases the optimizer is built using model.parameters(), Gemini currently cannot support the second way of writing code below:
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr=0.0001)
Objectives
- Decouple ChunkManager and Model. Make ChunkManager an independent module so that multiple models can register their parameters to the same ChunkManager, enabling heterogeneous memory management for multiple models to be trained simultaneously. This could be helpful for ChatGPT.
- Decouple model and Gemini optimizer initialization to achieve true unification with PyTorch interface.
Self-service
- [X] I'd be willing to do some initial work on this proposal myself.