Corey adams
Corey adams
Here you go: https://gist.github.com/coreyjadams/ad7ba4d544822d14a8e9bd1b9849e004
Your changes surprise me: I would not expect switching from cudaMemCpy to cudaMemCpyAsync to ... enforce synchronization? Either way, though, I can confirm it's fixed on both my laptop and...
Update; It's working multi-node now too, with rather large memory buffers copying successfully. I think the fix is successful!
On `DTensor`: It is unlikely that `DTensor` will ever be suitable for this task. The challenge is that `DTensor` explicitly assumes tensors are distributed across ranks as if you called...
I believe this functionality is now handled with `ShardTensor`: https://docs.nvidia.com/physicsnemo/latest/user-guide/domain_parallelism_entry_point.html. Please, open a fresh issue if more functionality is needed?
/blossom-ci
Hi all, Thanks @garrett361 for another bug report! I want to confirm I have reproduced this on Sunspot, with the 2024.1 oneAPI release and corresponding ipex. Oneccl is linked to...
I also notice a dramatic timing difference between doing 12 ranks on one node (very very fast) and 12 ranks over 2 nodes (a lot slower). Yes, bandwith is not...
FYI @mnabian @ktangsali I plan to make these changes after the RC candidate.