Megatron-LM
Megatron-LM copied to clipboard
[BUG] AttributeError: module 'transformer_engine' has no attribute 'pytorch'
Describe the bug
I am running InstructRetro and start with data preprocessing, with bash tools/retro/examples/preprocess_data.sh db-build
Stack trace/logs Due to torchrun's multiprocessing, the output stack trace is messy. I manually extract the error message below:
in <module> class TELinear(te.pytorch.Linear):
AttributeError: module 'transformer_engine' has no attribute 'pytorch'
Environment (please complete the following information):
- Megatron-LM commit ID: bd6f4ead41dac8aa8d50f46253630b7eba84bcdf
- PyTorch version: 2.1.1
- CUDA version: 12.2
Same problem, Pytorch version 2.1.0 and CUDA version 12.0
But pytorch dir do exist in transformer_engine package dir.
Same problem.
- Megatron-LM: 3709708ad233a0c2140de146ca6aaf3ecc05e66c
- Pytorch version 2.1.0
- CUDA version 12.0
Same problem
- Megatron-LM: 89574689447d694bb19dd86fc8a6153b4467ba9d
- Pytorch version 2.2.1
- CUDA version 11.8
It seems that the issue is caused by the missing libtorch_cuda_cpp.so
during the import of dependencies in __init__.py
, and this error is caught and passed by a try-catch block.
import flash_attn_2_cuda as flash_attn_cuda
ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory
In my env, I can only find these two files:
/home/aep/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/aep/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so
Discussions indicate that in newer versions of torch, libtorch_cuda_cpp.so
is no longer generated.
https://discuss.pytorch.org/t/no-libtorch-cuda-cpp-so-available-when-build-pytorch-from-source/159864/6
Thus, the issue seems to be that the TransformerEngine
's default installation of flash-attn
has not yet been adapted to the new version of pytorch. I resolved this problem by recompiling flash-attn
from the source:
git clone https://github.com/Dao-AILab/flash-attention -b v2.4.2
MAX_JOBS=8 pip install -e .
same problem even with recompiling flash-attn from the source Pytorch version 2.1.1 CUDA version 12.1
Pytorch version 2.1.0 CUDA version 12.1 It didn't work with the stable version of Transformer_engine (i.e., transformer-engine 1.5.0+6a9edc3). I don't have any idea, but somehow it worked with installing the latest version of Transformer_engine from source (i.e., transformer-engine-1.7.0.dev0+9709147 ).
$ git clone --recursive https://github.com/NVIDIA/TransformerEngine.git
$ cd TransformerEngine $ export NVTE_FRAMEWORK=pytorch # Optionally set framework $ pip install . ...... Stored in directory: /tmp/pip-ephem-wheel-cache-10dvr64i/wheels/9d/cf/7f/d14555553b5b30698dae0a4159fdd058157e7021cec565ecaa Successfully built transformer-engine flash-attn Installing collected packages: flash-attn, transformer-engine Successfully installed flash-attn-2.4.2 transformer-engine-1.7.0.dev0+9709147 wor
Following seems to work as well. $ pip install git+https://github.com/NVIDIA/TransformerEngine.git@9709147
Marking as stale. No activity in 60 days.