Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Fused kernel compilation could get stuck

Open rhythmswing opened this issue 3 years ago • 17 comments

Hi,

I've noticed that the program could get stuck at "using torch.float16 for parameters ...". I found that the problem was stuck at compilating fused_kernels and deleting megatron/fused_kernel/build seems to fix the problem. I'm not sure what causes this. I'm posting this in hope it could be helpful.

rhythmswing avatar Mar 14 '21 05:03 rhythmswing

same problem, stuck at here.

using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
^CTraceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/home/superbencher/Megatron-LM/megatron/training.py", line 87, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/home/superbencher/Megatron-LM/megatron/initialize.py", line 49, in initialize_megatron
    set_global_variables(extra_args_provider=extra_args_provider,
  File "/home/superbencher/Megatron-LM/megatron/global_vars.py", line 82, in set_global_variables
    args = _parse_args(extra_args_provider=extra_args_provider,
  File "/home/superbencher/Megatron-LM/megatron/global_vars.py", line 97, in _parse_args
    _GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
  File "/home/superbencher/Megatron-LM/megatron/arguments.py", line 188, in parse_args
    fused_kernels.load_scaled_upper_triang_masked_softmax_fusion_kernel()
  File "/home/superbencher/Megatron-LM/megatron/fused_kernels/__init__.py", line 60, in load_scaled_upper_triang_masked_softmax_fusion_kernel
    scaled_upper_triang_masked_softmax_cuda = cpp_extension.load(
  File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
    return _jit_compile(
  File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1306, in _jit_compile
    baton.wait()
  File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/file_baton.py", line 42, in wait
    time.sleep(self.wait_seconds)
KeyboardInterrupt

huangjundashuaige avatar Mar 23 '21 07:03 huangjundashuaige

skip it by setting --no-scaled-masked-softmax-fusion. do not how much it could affect end2end performance.

huangjundashuaige avatar Mar 23 '21 07:03 huangjundashuaige

delete Megatron-LM/megatron/fused_kernels/build/ and restart works for me.

bugface avatar Mar 23 '21 16:03 bugface

Same issue. Actually, I got it run by removing the megatron/fused_kernels/build as suggested by @bugface, but I am wondering if it is the right way to get it fixed?

bottergpt avatar Dec 15 '22 13:12 bottergpt

Experiencing the same issue here, even if the observed behaviour was different on different nodes of the cluster (not sure if it was caused by different software stacks or different gpus).

Deleting megatron/fused_kernels/build did not work for me, and I only managed to solve the issue by completely dropping fused kernels as suggested by @huangjundashuaige. To update this solution, the arguments that do this in the current version are --no-masked-softmax-fusion and --no-bias-dropout-fusion.

giacomocamposampiero avatar Dec 31 '22 09:12 giacomocamposampiero

Deleting /megatron/fused_kernels/build is recommended if you have upgraded CUDA versions or moved to different hardware. Those changes will not automatically be detected causing a rebuild of the kernels, which may be required.

We will be addressing this issue soon by moving to using the same prebuilt kernels from Apex and not requiring this custom kernel build step. I'll close this issue when that happens.

jon-barker avatar Jun 29 '23 20:06 jon-barker

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 29 '23 18:08 github-actions[bot]

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

SefaZeng avatar Sep 14 '23 04:09 SefaZeng

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Nov 13 '23 18:11 github-actions[bot]

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

Same here. Did you solved this problem?

MachineGunLin avatar Nov 27 '23 03:11 MachineGunLin

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why? @SefaZeng Same problem, have you fixed this problem?

ZhenYangIACAS avatar Dec 26 '23 09:12 ZhenYangIACAS

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

Same here. Did you solved this problem?

+1 same issue here

saforem2 avatar Jan 29 '24 22:01 saforem2

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

same here

mfdj2002 avatar Feb 09 '24 17:02 mfdj2002

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

Actually, it seems to be a problem with pytorch barrier, and simply setting NCCL_P2P_DISABLE=1 worked for me. credit: https://discuss.pytorch.org/t/torch-distributed-barrier-doesnt-work-with-pytorch-2-0-and-backend-nccl/190232

mfdj2002 avatar Feb 09 '24 20:02 mfdj2002

awesome to hear, will try this, thanks!

saforem2 avatar Feb 09 '24 21:02 saforem2

Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?

Actually, it seems to be a problem with pytorch barrier, and simply setting NCCL_P2P_DISABLE=1 worked for me. credit: https://discuss.pytorch.org/t/torch-distributed-barrier-doesnt-work-with-pytorch-2-0-and-backend-nccl/190232

I met this problem on one of my nodes. Working on that node along (NNODE=1), it does not work. To solve, I applied NCCL_P2P_DISABLE=1 to that node. This seems a hardware-related / BIOS setting issue. But distributed training by excluding that node worked for me.

kduxin avatar Mar 04 '24 03:03 kduxin

Marking as stale. No activity in 60 days.

github-actions[bot] avatar May 03 '24 18:05 github-actions[bot]