Rpgmaker
Rpgmaker
Sure, here is the full stacktrace ``` [2025-03-19 14:54:53,885] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [rank0]: Traceback (most recent call last): [rank0]: File "/xxxxxx/main.py", line 41, in [rank0]: run()...
Sure here is the new stacktrace with the env enabled. ``` [2025-03-19 18:52:01,562] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) Sliding Window Attention is enabled but not implemented for...
I tried to use the pin_memory: false and still end up with same issue. config: ``` { "train_batch_size": 1, "gradient_accumulation_steps": 1, "optimizer": { "type": "AdamW", "params": { "lr": 2e-5 }...
Sure, let me work on it and i will provide the steps with examples. Thanks
Below is the runs for both the CPU and the GPU runs for torch, It seem the cpu version did not fail ``` python -c "import torch; from deepspeed.accelerator import...
I tried using the changes with .cpu().pin_memory() and ended up with same errors ``` [rank0]: File "/home/xxxxxx/lib/python3.12/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 407, in __init__ [rank0]: weights_partition = get_accelerator().pin_memory(weights_partition) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/xxxxxx/lib/python3.12/site-packages/deepspeed/accelerator/cuda_accelerator.py",...
Here is the result below. ``` tensor.dtype: torch.float32 tensor.shape: torch.Size([1776255488]) tensor.device: cpu ``` code changes: ``` def pin_memory(self, tensor, align_bytes=1): print("tensor.dtype: ", tensor.dtype) print("tensor.shape: ", tensor.shape) print("tensor.device: ", tensor.device) return...
This is the result. it seem the allocation/pinning the memory is the issue: ``` python -c "import torch; from deepspeed.accelerator import get_accelerator; get_accelerator().pin_memory(torch.empty(1776255488, device='cpu'))" [2025-03-23 11:54:42,621] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator...
Thanks, will give it a try and let you know the results.
> > "offload_optimizer": { > > "device": "cpu", > > "pin_memory": False > > One workaround is to control this particular pinning by `pin_memory` in the ds_config. So to unblock...