Megatron-LM
Megatron-LM copied to clipboard
Fused kernel compilation could get stuck
Hi,
I've noticed that the program could get stuck at "using torch.float16 for parameters ...". I found that the problem was stuck at compilating fused_kernels and deleting megatron/fused_kernel/build seems to fix the problem. I'm not sure what causes this. I'm posting this in hope it could be helpful.
same problem, stuck at here.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
^CTraceback (most recent call last):
File "pretrain_gpt.py", line 149, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/home/superbencher/Megatron-LM/megatron/training.py", line 87, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/home/superbencher/Megatron-LM/megatron/initialize.py", line 49, in initialize_megatron
set_global_variables(extra_args_provider=extra_args_provider,
File "/home/superbencher/Megatron-LM/megatron/global_vars.py", line 82, in set_global_variables
args = _parse_args(extra_args_provider=extra_args_provider,
File "/home/superbencher/Megatron-LM/megatron/global_vars.py", line 97, in _parse_args
_GLOBAL_ARGS = parse_args(extra_args_provider=extra_args_provider,
File "/home/superbencher/Megatron-LM/megatron/arguments.py", line 188, in parse_args
fused_kernels.load_scaled_upper_triang_masked_softmax_fusion_kernel()
File "/home/superbencher/Megatron-LM/megatron/fused_kernels/__init__.py", line 60, in load_scaled_upper_triang_masked_softmax_fusion_kernel
scaled_upper_triang_masked_softmax_cuda = cpp_extension.load(
File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
return _jit_compile(
File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1306, in _jit_compile
baton.wait()
File "/home/superbencher/.local/lib/python3.8/site-packages/torch/utils/file_baton.py", line 42, in wait
time.sleep(self.wait_seconds)
KeyboardInterrupt
skip it by setting --no-scaled-masked-softmax-fusion. do not how much it could affect end2end performance.
delete Megatron-LM/megatron/fused_kernels/build/
and restart works for me.
Same issue.
Actually, I got it run by removing the megatron/fused_kernels/build
as suggested by @bugface,
but I am wondering if it is the right way to get it fixed?
Experiencing the same issue here, even if the observed behaviour was different on different nodes of the cluster (not sure if it was caused by different software stacks or different gpus).
Deleting megatron/fused_kernels/build
did not work for me, and I only managed to solve the issue by completely dropping fused kernels as suggested by @huangjundashuaige. To update this solution, the arguments that do this in the current version are --no-masked-softmax-fusion
and --no-bias-dropout-fusion
.
Deleting /megatron/fused_kernels/build
is recommended if you have upgraded CUDA versions or moved to different hardware. Those changes will not automatically be detected causing a rebuild of the kernels, which may be required.
We will be addressing this issue soon by moving to using the same prebuilt kernels from Apex and not requiring this custom kernel build step. I'll close this issue when that happens.
Marking as stale. No activity in 60 days.
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
Marking as stale. No activity in 60 days.
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
Same here. Did you solved this problem?
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why? @SefaZeng Same problem, have you fixed this problem?
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
Same here. Did you solved this problem?
+1 same issue here
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
same here
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
Actually, it seems to be a problem with pytorch barrier, and simply setting NCCL_P2P_DISABLE=1 worked for me. credit: https://discuss.pytorch.org/t/torch-distributed-barrier-doesnt-work-with-pytorch-2-0-and-backend-nccl/190232
awesome to hear, will try this, thanks!
Got stuck when compiling the fused_kernels when training on multiple nodes. But it works well in a single node. Why?
Actually, it seems to be a problem with pytorch barrier, and simply setting NCCL_P2P_DISABLE=1 worked for me. credit: https://discuss.pytorch.org/t/torch-distributed-barrier-doesnt-work-with-pytorch-2-0-and-backend-nccl/190232
I met this problem on one of my nodes. Working on that node along (NNODE=1), it does not work. To solve, I applied NCCL_P2P_DISABLE=1 to that node. This seems a hardware-related / BIOS setting issue. But distributed training by excluding that node worked for me.
Marking as stale. No activity in 60 days.