Garrett Goon issues

Results 10 issues of


                                            Garrett Goon

feat: shiftvit port example

## Description Porting exercise sanity checking. Not planning to merge. ## Test Plan ## Commentary (optional) ## Checklist - [ ] User-facing API changes need the "User-facing API Change" label....

cla-signed

fix: dsat search space updates

## Description The main goal of this PR was to update the search space used for our DeepSpeed Autotune (`dsat`) module. Previously the stage-3 search space searched over irrelevant fields...

cla-signed

xpu.Event reporting erroneous times

### Describe the bug Hi, I am trying to time `xpu` operations using `xpu.Event` analogously to how `cuda.Event` is used and getting unexpected results. Is `xpu.Event` supported? I didn't see...

XPU/GPU

Communication and compute on separate Streams do not overlap

### Describe the bug Communication and computation do not appear to overlap when launching kernels in different `xpu.Stream`s (on Intel GPU Max 1550s). Being able to overlap communication and communication...

dGPU-Max

Functionality

Hanging on unicode characters

Hello, great plugin! I was hoping to have a user-defined loop with unicode characters in it. Trying something minimal like the below, ```lua require("boole").setup({ additions = { { "▽", "△"...

reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

### Describe the bug Repeated calls into `torch.dist.reduce_scatter_tensor` eventually raise a `ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY` error in multi-node setups. Similar behavior is found when using Fully Sharded Data Parallel, which calls into `reduce_scatter_tensor`...

XPU/GPU

Functionality

FSDP raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

### Describe the bug When using two Intel GPU 1550 nodes, Fully Sharded Data Parallel raises OOM errors (`ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`), similar to #640 . Tested by wrapping a simple linear model:...

reduce_scatter raises a RuntimeError

Hello, cross posting from [ipex #647](https://github.com/intel/intel-extension-for-pytorch/issues/647): torch-ccl does not support `torch.distributed.reduce_scatter`, despite the claims in the docs. For instance, in 2.1.300+xpu we have: https://github.com/intel/torch-ccl/blob/b9ce71371fdb11f980befaa9d49a36a3c2c6e82b/src/ProcessGroupCCL.cpp#L871-L877 where the `TORCH_CHECK` line raises a...

reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

Cross posting from this [ipex issue](https://github.com/intel/intel-extension-for-pytorch/issues/640). Repeated calls into `torch.dist.reduce_scatter_tensor` eventually raise a `ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY` error in multi-node setups. Similar behavior is found when using Fully Sharded Data Parallel, which calls...

Communication and compute on separate Streams do not overlap

Cross-posting [this issue](https://github.com/intel/intel-extension-for-pytorch/issues/599) from `ipex`, in case the `torch-ccl` team is not aware of it. Key issues: * Compute and collective communications do not overlap on intel GPU devices *...