Megatron-LM
Megatron-LM copied to clipboard
[BUG] ModuleNotFoundError: No module named 'scaled_softmax_cuda'
Describe the bug
When I try to run single GPU T5 Pretraining with the script examples/pretrain_t5.sh
, it outputs the following error:
ModuleNotFoundError: No module named 'scaled_softmax_cuda'
It seems that the code lacks of module scaled_softmax_cuda or do I need to install the relevant python module ?
Stack trace/logs
Traceback (most recent call last): File "/home/ubuntu/projects/Megatron-LM/pretrain_t5.py", line 239, inpretrain(train_valid_test_datasets_provider, model_provider, ModelType.encoder_and_decoder, File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 261, in pretrain iteration, num_floating_point_operations_so_far = train( File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 967, in train train_step(forward_step_func, File "/home/ubuntu/projects/Megatron-LM/megatron/training.py", line 532, in train_step losses_reduced = forward_backward_func( File "/home/ubuntu/projects/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 372, in forward_backward_no_pipelining output_tensor = forward_step( File "/home/ubuntu/projects/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 192, in forward_step output_tensor, loss_func = forward_step_func(data_iterator, model) File "/home/ubuntu/projects/Megatron-LM/pretrain_t5.py", line 176, in forward_step output_tensor = model(tokens_enc, File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 179, in forward return self.module(*inputs, **kwargs) File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/module.py", line 190, in forward outputs = self.module(*inputs, **kwargs) File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/t5_model.py", line 118, in forward lm_output = self.language_model(encoder_input_ids, File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/language_model.py", line 527, in forward decoder_output = self.decoder( File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 1776, in forward hidden_states = layer( File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 1210, in forward self.default_decoder_cross_attention( File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 943, in default_decoder_cross_attention self.inter_attention(norm_output, File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 798, in forward context_layer = self.core_attention( File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/transformer.py", line 384, in forward attention_probs = self.scale_mask_softmax(attention_scores, File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 148, in forward return self.forward_fused_softmax(input, mask) File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 190, in forward_fused_softmax return ScaledSoftmax.apply(input, scale) File "/home/ubuntu/Venv/torch2.0.0-cu118-cp310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/ubuntu/projects/Megatron-LM/megatron/model/fused_softmax.py", line 80, in forward import scaled_softmax_cuda ModuleNotFoundError: No module named 'scaled_softmax_cuda'
Environment (please complete the following information):
- Megatron-LM cafda9529d9956578014d4cb89b69b741702b514
- PyTorch 2.0.0
- CUDA 11.8
You may use either of the following solutions:
- The library
scaled_softmax_cuda
is contained in apex. You may install it from https://github.com/NVIDIA/apex . - Add
--no-masked-softmax-fusion
to avoid the use of fused kernel.
You may use either of the following solutions:
- The library
scaled_softmax_cuda
is contained in apex. You may install it from https://github.com/NVIDIA/apex .- Add
--no-masked-softmax-fusion
to avoid the use of fused kernel.
Thank you for your reply. Solution 2 has fixed the problem. However, after I install apex, the ModuleNotFoundError problem still occurs. The installing command is as follows:
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
I use pip list|grep apex
to obtain apex with version 0.1 and find scaled_masked_softmax_cuda.cpython-310-x86_64-linux-gnu.so in the directory of torch2.0.0-cu118-cp310/lib/python3.10/site-packages. Do I fail to install scaled_softmax_cuda?
@liuliuliu0605
I installed apex in the same way. scaled_softmax_cuda
should also be included in apex.
@yuantailing Thanks for providing the details. I rember when I installed apex master branch but failed. The log is install.log. Can it be caused by incompatible cuda version ?
So I choose to install apex 22.04-dev branch, which actually does not include scaled_softmax_cuda.cu
file. Therefore, the module scaled_softmax_cuda
can not be found.
Marking as stale. No activity in 60 days.