TransformerEngine
TransformerEngine copied to clipboard
create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations.
Container:
nvcr.io/nvidia/pytorch:24.05-py3
Machine:
x86 CPU with A100 node
Reproduce:
python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000
It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value.
see also the definition of struct communicator
struct communicator {
...
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc
Hope this hint helps the progress.
The commit on this fork will fix this. https://github.com/denera/TransformerEngine/commit/7a9522bdbbe28d2682567ea450f10d87cc68d03a
@anderson101866 This should be fixed now in TE/main as of PR #1087. Could you check and close the issue if resolved?
Yes, it’s resolved. Thanks for your greatly help!