ColossalAI
ColossalAI copied to clipboard
DeepSeekV3 lora fine tune, allocate memory too big !
env:
node: 3
gpus per node: a100*8
error info:
... File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline [rank23]: return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1409, in execute_pipeline [rank23]: outputs = self.scheduler.forward_backward_step( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 472, in forward_backward_step [rank23]: result = self.run_forward_backward(model, data_iter, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 400, in run_forward_backward [rank23]: input_obj = self.recv_forward() [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 131, in recv_forward [rank23]: input_tensor, _ = self.comm.recv_forward(prev_rank, metadata_recv=self.tensor_metadata_recv) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 558, in recv_forward [rank23]: input_tensor, wait_handles = _communicate( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 414, in _communicate [rank23]: _metadata_recv = _send_recv_serialization_object( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 328, in _send_recv_serialization_object [rank23]: recv_object_tensor = torch.empty(recv_object_size_tensor.item(), dtype=torch.uint8) [rank23]: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 140125084347488 bytes. Error code 12 (Cannot allocate memory)