lpty
lpty
https://blog.csdn.net/sinat_33741547/article/details/79933376
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:462 in __init__ │ │ │ │ 459 │ │ │ #print(torch.cuda.memory.memory_summary()) │ │ 460 │ │ │ print(torch.cuda.memory_allocated()/1024/1024, torch.cuda.max_memory_allocat │ │ 461 │ │ │ print(get_accelerator(), self.device, largest_param_numel/1024/1024,...
@tjruwase cpu/gpu memory is enough to do this job,i run the model gpt-neo-125M in a nvidia 3090-24g and cpu memory is 64g. system `win10 with a nvidia 3090-24g and cpu...
after i edit this code ,it works, somebody know why? ``` /opt/conda/lib/python3.8/site-packages/deepspeed/accelerator/cuda_accelerator.py def pin_memory(self, tensor): print('#################come on', tensor) #return tensor.pin_memory() return tensor ``` log ``` Loading extension module utils... Time...
@tjruwase yeah,this is a 1.3B model, but 125M models also report errors. The reason I found out is the location of the pin_memory, which is strange
@tjruwase raise a RuntimeError: CUDA error: out of memory ``` [2023-03-24 13:10:21,979] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-03-24 13:10:21,983] [INFO] [logging.py:75:log_dist] [Rank...
@tjruwase cpu memory usage is only 10%, it should be enough to load this tensor ``` Rank: 0 partition count [1] and sizes[(125198592, False)] #################come on tensor([0., 0., 0., ...,...