Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Open zhentingqi opened this issue 1 year ago • 6 comments

Describe the bug I am running InstructRetro and start with data preprocessing, with bash tools/retro/examples/preprocess_data.sh db-build

Stack trace/logs Due to torchrun's multiprocessing, the output stack trace is messy. I manually extract the error message below:

in <module> class TELinear(te.pytorch.Linear):
AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Environment (please complete the following information):

  • Megatron-LM commit ID: bd6f4ead41dac8aa8d50f46253630b7eba84bcdf
  • PyTorch version: 2.1.1
  • CUDA version: 12.2

zhentingqi avatar Feb 19 '24 20:02 zhentingqi

Same problem, Pytorch version 2.1.0 and CUDA version 12.0

But pytorch dir do exist in transformer_engine package dir.

sakura-umi avatar Mar 04 '24 09:03 sakura-umi

Same problem.

changingivan avatar Mar 08 '24 09:03 changingivan

Same problem

aeeeeeep avatar Mar 09 '24 11:03 aeeeeeep

It seems that the issue is caused by the missing libtorch_cuda_cpp.so during the import of dependencies in __init__.py, and this error is caught and passed by a try-catch block.

import flash_attn_2_cuda as flash_attn_cuda
ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

In my env, I can only find these two files:

/home/aep/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/aep/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so

Discussions indicate that in newer versions of torch, libtorch_cuda_cpp.so is no longer generated. https://discuss.pytorch.org/t/no-libtorch-cuda-cpp-so-available-when-build-pytorch-from-source/159864/6

Thus, the issue seems to be that the TransformerEngine's default installation of flash-attn has not yet been adapted to the new version of pytorch. I resolved this problem by recompiling flash-attn from the source:

git clone https://github.com/Dao-AILab/flash-attention -b v2.4.2
MAX_JOBS=8 pip install -e .

aeeeeeep avatar Mar 10 '24 03:03 aeeeeeep

same problem even with recompiling flash-attn from the source Pytorch version 2.1.1 CUDA version 12.1

hwang2006 avatar Apr 22 '24 23:04 hwang2006

Pytorch version 2.1.0 CUDA version 12.1 It didn't work with the stable version of Transformer_engine (i.e., transformer-engine 1.5.0+6a9edc3). I don't have any idea, but somehow it worked with installing the latest version of Transformer_engine from source (i.e., transformer-engine-1.7.0.dev0+9709147 ).

$ git clone --recursive https://github.com/NVIDIA/TransformerEngine.git

$ cd TransformerEngine $ export NVTE_FRAMEWORK=pytorch # Optionally set framework $ pip install . ...... Stored in directory: /tmp/pip-ephem-wheel-cache-10dvr64i/wheels/9d/cf/7f/d14555553b5b30698dae0a4159fdd058157e7021cec565ecaa Successfully built transformer-engine flash-attn Installing collected packages: flash-attn, transformer-engine Successfully installed flash-attn-2.4.2 transformer-engine-1.7.0.dev0+9709147 wor

Following seems to work as well. $ pip install git+https://github.com/NVIDIA/TransformerEngine.git@9709147

hwang2006 avatar Apr 28 '24 00:04 hwang2006

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jun 27 '24 18:06 github-actions[bot]