Yao Matrix
Yao Matrix
@mehdiir, We tried to reproduce your work in our env and found one weird issue: by using your code, `gradient_checkpointing=True` runs much faster than `gradient_checkpointing=False` which betrayed our intuition(2 hr...
Since PyTorch 2.5, XPU already is the built-in device of PyTorch. In this PR, we extend Galore to XPU using PyTorch official APIs.
@ArthurZucker, pls help review, thx very much.
1. when run `pytest -rA tests/models/unets/test_models_unet_2d_condition.py::UNet2DConditionModelTests::test_load_sharded_checkpoint_device_map_from_hub_local` on 8 devices(CUDA, XPU), there will be a RuntimeError "RuntimeError: Expected all tensors to be on the same device, but found at least two...
1. enable xpu for launcher -> validated 2. expand cuda only ds uts to xpu -> all 3 passed 3. expand profiler example to xpu -> validated @SunMarc , pls...