NR Wu
NR Wu
Before llama impl is merged in mega-ds, we implemented another llama in our private repo. And we found that U can at most train 13B llama w/o offloading with 8...
We have been working on LM recently, and encountered this problem. I am trying to fix it. @ShadenSmith @duli2012
In DS, large models will be allocated in zero.Init context, is there anything similar in torch FSDP? ```python with deepspeed.zero.Init(): model = MyLargeModel() ```
> It is not necessary to move the model to GPU before passing to FSDP: > > ``` > model = Net().to(rank) > ``` > > You only need to...
> FSDP has some support for deferred initialization if you look at the `param_init_fn` constructor argument, which would allow exceeding the capacity of CPU DRAM. However, the current support is...
`stCoRoutineAttr_t` 有构造函数的 ```c++ stCoRoutineAttr_t() { stack_size = 128 * 1024; share_stack = NULL; } ```
same problem here.
> Any updates on this? Same question.