TransformerEngine
TransformerEngine copied to clipboard
Cannot import and use transformer_engine after successful installation with No module named 'transformer_engine_extensions'
I have installed transformer_engine for use with Accelerate and Ray. I have the following requirements which work totally fine for all sorts of distributed training
torch==2.2.1
transformers==4.39.3
accelerate==0.29.2
deepspeed==0.14.0
datasets==2.18.0
sentencepiece==0.2.0
transformer_engine @ git+https://github.com/NVIDIA/TransformerEngine.git@stable
And for your information my Dockerfile looks like
FROM anyscale/ray:2.11.0-py39-cu121
COPY ./training-requirements.txt training-requirements.txt
RUN pip install -r training-requirements.txt --global-option=--debug
RUN python -c "import transformer_engine.pytorch as te"
But have also tried different versions of torch including latest.
I can import transformer_engine just fine when this image is deployed
>>> import transformer_engine
>>>
However, when I try to import the pytorch module I get
>>> import transformer_engine.pytorch as te
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformer_engine/pytorch/__init__.py", line 6, in <module>
from .module import LayerNormLinear
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformer_engine/pytorch/module/__init__.py", line 6, in <module>
from .layernorm_linear import LayerNormLinear
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformer_engine/pytorch/module/layernorm_linear.py", line 13, in <module>
from .. import cpp_extensions as tex
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformer_engine/pytorch/cpp_extensions/__init__.py", line 6, in <module>
from transformer_engine_extensions import *
ModuleNotFoundError: No module named 'transformer_engine_extensions'
And when I try to import transformer_engine_extensions directly I get the same thing
>>> import transformer_engine_extensions
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'transformer_engine_extensions'
Wondering what is going on with the versions. Everything else works fine with torch it's just that transformer_engine seems to be installed incorrectly.
Here are some Nvidia specs as well
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
$ nvidia-smi
Sat May 18 16:47:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
Do you know if PyTorch is installed before you install the requirements.txt? TE supports DL frameworks other than PyTorch (e.g. JAX), so part of our build process involves checking what frameworks are installed. I suspect that TE doesn't find PyTorch, so it skips building the PyTorch extensions. To force it to build the PyTorch extensions, can you try setting NVTE_FRAMEWORK=pytorch in the environment?
@timmoon10 Thanks. If anyone wants to try importing transformer_engine_extensions after successfully building the PyTorch extensions. This is how you do it.
Python
import torch
import transformer_engine
import transformer_engine_extensions
I'm using cuda 12.5 and I'm using the nightly cuda 12.4 pytorch build on artix linux.
Not working for the current main branch of TE.
Solved. The transformer_engine_extensions has been renamed to transformer_engine_torch.