TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Could not work , even use the official script

Open hellangleZ opened this issue 7 months ago • 5 comments

(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** Fatal Python error: Segmentation fault

Current thread 0x00007f2714afb740 (most recent call first): File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20) Segmentation fault (core dumped)

hellangleZ avatar Jul 12 '24 10:07 hellangleZ