PiPPy icon indicating copy to clipboard operation
PiPPy copied to clipboard

Hard to debug issue when passing a DTensor to spmd.distribute_tensor() (cuda/nccl only)

Open aazzolini opened this issue 3 years ago • 0 comments

Passing a DTensor into spmd.distribute_tensor , or more specifically, into DeviceMesh, will cause issues

  • in device_mesh.broadcast, it will cause an assert to fail deep into torch code
  • in device_mesh.scatter, it will cause an invalid free in the CachingAllocator.

This is most likely related to tensor sub-classing corner cases.

We should at least check that no DTensor is passed into DMesh for now.

aazzolini avatar Oct 03 '22 04:10 aazzolini