TransformerEngine
TransformerEngine copied to clipboard
Could not work , even use the official script
(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** Fatal Python error: Segmentation fault
Current thread 0x00007f2714afb740 (most recent call first):
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper
File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20) Segmentation fault (core dumped)
We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:
https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207
Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?
import sys
def _print(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")
We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that
fsdp.pydidn't print out the world size after initialization:https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207
Can you try running the following script with
python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?import sys def _print(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) _print("Starting script") import torch _print("Imported PyTorch") torch.distributed.init_process_group(backend="nccl") _print("Initialized NCCL") rank = torch.distributed.get_rank() world_size = torch.distributed.get_world_size() _print(f"{rank=}, {world_size=}")
HI @timmoon10 this is the output of the script
We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that
fsdp.pydidn't print out the world size after initialization:https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207
Can you try running the following script with
python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?import sys def _print(*args, **kwargs): print(*args, file=sys.stderr, **kwargs) _print("Starting script") import torch _print("Imported PyTorch") torch.distributed.init_process_group(backend="nccl") _print("Initialized NCCL") rank = torch.distributed.get_rank() world_size = torch.distributed.get_world_size() _print(f"{rank=}, {world_size=}")
HI @timmoon10 this is the output of the script
Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L205-L207 Differences I can see:
python -m torch.distributed.launchvstorchrun- Multi-GPU vs single-GPU
torch.cuda.set_device- PyTorch and Transformer Engine imports: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L10-L22
@hellangleZ you can also try this FSDP test from HuggingFace Accelerate that uses TE/FP8: https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8. It handles the FSDP configuration.