Zhengxiong Luo
Results
11
comments of
Zhengxiong Luo
The same problem. I am using zero3 to train a transformer with multi-nodes. On each node, deepspeed allocates much larger memory to the GPU with local_rank=0.
Zhengxiong Luo
The same problem. I am using zero3 to train a transformer with multi-nodes. On each node, deepspeed allocates much larger memory to the GPU with local_rank=0.