Rpgmaker comments

Results 14 comments of


                                            Rpgmaker

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Sure, here is the full stacktrace ``` [2025-03-19 14:54:53,885] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [rank0]: Traceback (most recent call last): [rank0]: File "/xxxxxx/main.py", line 41, in [rank0]: run()...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Sure here is the new stacktrace with the env enabled. ``` [2025-03-19 18:52:01,562] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) Sliding Window Attention is enabled but not implemented for...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

I tried to use the pin_memory: false and still end up with same issue. config: ``` { "train_batch_size": 1, "gradient_accumulation_steps": 1, "optimizer": { "type": "AdamW", "params": { "lr": 2e-5 }...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Sure, let me work on it and i will provide the steps with examples. Thanks

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Below is the runs for both the CPU and the GPU runs for torch, It seem the cpu version did not fail ``` python -c "import torch; from deepspeed.accelerator import...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

I tried using the changes with .cpu().pin_memory() and ended up with same errors ``` [rank0]: File "/home/xxxxxx/lib/python3.12/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 407, in __init__ [rank0]: weights_partition = get_accelerator().pin_memory(weights_partition) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/xxxxxx/lib/python3.12/site-packages/deepspeed/accelerator/cuda_accelerator.py",...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Here is the result below. ``` tensor.dtype: torch.float32 tensor.shape: torch.Size([1776255488]) tensor.device: cpu ``` code changes: ``` def pin_memory(self, tensor, align_bytes=1): print("tensor.dtype: ", tensor.dtype) print("tensor.shape: ", tensor.shape) print("tensor.device: ", tensor.device) return...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

This is the result. it seem the allocation/pinning the memory is the issue: ``` python -c "import torch; from deepspeed.accelerator import get_accelerator; get_accelerator().pin_memory(torch.empty(1776255488, device='cpu'))" [2025-03-23 11:54:42,621] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator...

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

Thanks, will give it a try and let you know the results.

[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8

> > "offload_optimizer": { > > "device": "cpu", > > "pin_memory": False > > One workaround is to control this particular pinning by `pin_memory` in the ds_config. So to unblock...