TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Could not work , even use the official script

Open hellangleZ opened this issue 1 year ago • 5 comments

(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] ***************************************** Fatal Python error: Segmentation fault

Current thread 0x00007f2714afb740 (most recent call first): File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20) Segmentation fault (core dumped)

hellangleZ avatar Jul 12 '24 10:07 hellangleZ

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

timmoon10 avatar Jul 12 '24 18:07 timmoon10

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

Uploading image.png…

hellangleZ avatar Jul 13 '24 01:07 hellangleZ

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L207

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

image

hellangleZ avatar Jul 13 '24 01:07 hellangleZ

Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L205-L207 Differences I can see:

  • python -m torch.distributed.launch vs torchrun
  • Multi-GPU vs single-GPU
  • torch.cuda.set_device
  • PyTorch and Transformer Engine imports: https://github.com/NVIDIA/TransformerEngine/blob/8e039fdcd98fc56582d81e373880c1509c2b8f73/examples/pytorch/fsdp/fsdp.py#L10-L22

timmoon10 avatar Jul 15 '24 18:07 timmoon10

@hellangleZ you can also try this FSDP test from HuggingFace Accelerate that uses TE/FP8: https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8. It handles the FSDP configuration.

sbhavani avatar Sep 05 '24 14:09 sbhavani