TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...

Results 414 TransformerEngine issues
Sort by recently updated
recently updated
newest added

This PR modifies `te.distributed.checkpoint(...)` to preserve the `torch.amp.autocast(...)` context from the forward pass during the recompute phase. Reported in #787.

bug
1.7.0

# Description The default `find_package` is searching for all components. From the CMakeLists.txt I modified, you are linking only against `MPI::MPI_CXX` target. It is unnecessary to find the other C...

Hi, we are looking into training some transformer models with FP8 and we see a lot of overhead on the CPU side when te.Linear layers are scheduled in the forward...

performance

I am trying to install TransformerEngine using following : `pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable` facing following error ``` RuntimeError: Error when running CMake: Command '['/tmp/pip-req-build-wpw9pxi1/.eggs/cmake-3.28.3-py3.11-linux-x86_64.egg/cmake/data/bin/cmake', '-S', '/tmp/pip-req-build-wpw9pxi1/transformer_engine', '-B', '/tmp/tmps_krasnv', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-wpw9pxi1/build/lib.linux-x86_64-cpython-311', '-Dpybind11_DIR=/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/pybind11/share/cmake/pybind11']'...

build

Hi, We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing `TransformerEngine`. It reports ` RuntimeError:...

**Summary:** When using context parallelism, we've observed that adopting fp32 accumulation for attention operations in both the forward and backward passes significantly improves numerical accuracy. This approach aligns with practices...

community-contribution

Is there currently a way to use MLP without applying the LayerNorm? What would be the best way to implement this? Thanks!

enhancement

This adds support for multi-node nvlink architecture. In addition it includes changes for making CE deadlock checker configurable at the runtime.

Is the `interval` attribute of [DelayedScaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html#transformer_engine.common.recipe.DelayedScaling) not used in PyTorch within the current TransformerEngine? In other words, does the value of `DelayedScaling.interval` affect the computation frequency of the scaling factor...