TransformerEngine
TransformerEngine copied to clipboard
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
This PR modifies `te.distributed.checkpoint(...)` to preserve the `torch.amp.autocast(...)` context from the forward pass during the recompute phase. Reported in #787.
# Description The default `find_package` is searching for all components. From the CMakeLists.txt I modified, you are linking only against `MPI::MPI_CXX` target. It is unnecessary to find the other C...
Hi, we are looking into training some transformer models with FP8 and we see a lot of overhead on the CPU side when te.Linear layers are scheduled in the forward...
I am trying to install TransformerEngine using following : `pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable` facing following error ``` RuntimeError: Error when running CMake: Command '['/tmp/pip-req-build-wpw9pxi1/.eggs/cmake-3.28.3-py3.11-linux-x86_64.egg/cmake/data/bin/cmake', '-S', '/tmp/pip-req-build-wpw9pxi1/transformer_engine', '-B', '/tmp/tmps_krasnv', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-wpw9pxi1/build/lib.linux-x86_64-cpython-311', '-Dpybind11_DIR=/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/pybind11/share/cmake/pybind11']'...
Hi, We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing `TransformerEngine`. It reports ` RuntimeError:...
**Summary:** When using context parallelism, we've observed that adopting fp32 accumulation for attention operations in both the forward and backward passes significantly improves numerical accuracy. This approach aligns with practices...
Is there currently a way to use MLP without applying the LayerNorm? What would be the best way to implement this? Thanks!
This adds support for multi-node nvlink architecture. In addition it includes changes for making CE deadlock checker configurable at the runtime.
Is the `interval` attribute of [DelayedScaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html#transformer_engine.common.recipe.DelayedScaling) not used in PyTorch within the current TransformerEngine? In other words, does the value of `DelayedScaling.interval` affect the computation frequency of the scaling factor...