TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations.

Open anderson101866 opened this issue 1 year ago • 1 comments

Container:

nvcr.io/nvidia/pytorch:24.05-py3

Machine:

x86 CPU with A100 node

Reproduce:

python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value. see also the definition of struct communicator

struct communicator {
...  
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc

image Hope this hint helps the progress.

anderson101866 avatar Jun 24 '24 13:06 anderson101866

The commit on this fork will fix this. https://github.com/denera/TransformerEngine/commit/7a9522bdbbe28d2682567ea450f10d87cc68d03a

anderson101866 avatar Jun 28 '24 15:06 anderson101866

@anderson101866 This should be fixed now in TE/main as of PR #1087. Could you check and close the issue if resolved?

denera avatar Aug 16 '24 20:08 denera

Yes, it’s resolved. Thanks for your greatly help!

anderson101866 avatar Aug 17 '24 05:08 anderson101866