Corey adams comments

Results 63 comments of


                                            Corey adams

Unexpected behavior with JIT'ing allreduce

Here you go: https://gist.github.com/coreyjadams/ad7ba4d544822d14a8e9bd1b9849e004

Unexpected behavior with JIT'ing allreduce

Your changes surprise me: I would not expect switching from cudaMemCpy to cudaMemCpyAsync to ... enforce synchronization? Either way, though, I can confirm it's fixed on both my laptop and...

Unexpected behavior with JIT'ing allreduce

Update; It's working multi-node now too, with rather large memory buffers copying successfully. I think the fix is successful!

🚀[FEA]: Distributed Training/Inference: handle scatter/gather better and more consistently

On `DTensor`: It is unlikely that `DTensor` will ever be suitable for this task. The challenge is that `DTensor` explicitly assumes tensors are distributed across ranks as if you called...

🚀[FEA]: Distributed Training/Inference: handle scatter/gather better and more consistently

I believe this functionality is now handled with `ShardTensor`: https://docs.nvidia.com/physicsnemo/latest/user-guide/domain_parallelism_entry_point.html. Please, open a fresh issue if more functionality is needed?

fixing member reference

/blossom-ci

reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

Hi all, Thanks @garrett361 for another bug report! I want to confirm I have reproduced this on Sunspot, with the 2024.1 oneAPI release and corresponding ipex. Oneccl is linked to...

reduce_scatter_tensor raises ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY in multi-node usage

I also notice a dramatic timing difference between doing 12 ranks on one node (very very fast) and 12 ranks over 2 nodes (a lot slower). Yes, bandwith is not...

🐛[BUG]: MeshGraphNet tests fail if default norm is TELayerNorm

FYI @mnabian @ktangsali I plan to make these changes after the RC candidate.

Creating an error when a SLURM variable isn't found, usually because …

/blossom-ci