TransformerEngine issues

[PyTorch] Support `torch.amp.autocast` in TE checkpoint

1

This PR modifies `te.distributed.checkpoint(...)` to preserve the `torch.amp.autocast(...)` context from the forward pass during the recompute phase. Reported in #787.

denera

bug

1.7.0

Find CXX component for MPI, fortran and C are not needed

2

# Description The default `find_package` is searching for all components. From the CMakeLists.txt I modified, you are linking only against `MPI::MPI_CXX` target. It is unnecessary to find the other C...

aurianer

CPU Overhead of te.Linear FP8 Layers

7

Hi, we are looking into training some transformer models with FP8 and we see a lot of overhead on the CPU side when te.Linear layers are scheduled in the forward...

tohinz

performance

ERROR: Failed building wheel for transformer-engine

4

I am trying to install TransformerEngine using following : `pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable` facing following error ``` RuntimeError: Error when running CMake: Command '['/tmp/pip-req-build-wpw9pxi1/.eggs/cmake-3.28.3-py3.11-linux-x86_64.egg/cmake/data/bin/cmake', '-S', '/tmp/pip-req-build-wpw9pxi1/transformer_engine', '-B', '/tmp/tmps_krasnv', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-wpw9pxi1/build/lib.linux-x86_64-cpython-311', '-Dpybind11_DIR=/home/shabs/anaconda3/envs/NeMo/lib/python3.11/site-packages/pybind11/share/cmake/pybind11']'...

ShabnamRA

build

Installation failed with cmake error

21

Hi, We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing `TransformerEngine`. It reports ` RuntimeError:...

RuiWang1998

[Pytorch] Implement fp32 accumulation for attention with context parallel in both forward and backward pass.

1

**Summary:** When using context parallelism, we've observed that adopting fp32 accumulation for attention operations in both the forward and backward passes significantly improves numerical accuracy. This approach aligns with practices...

Yuxin-CV

community-contribution

[JAX] POC of custom call selection

zlsh80826

MLP without LayerNorm

1

Is there currently a way to use MLP without applying the LayerNorm? What would be the best way to implement this? Thanks!

sriniiyer

enhancement

[UB] Adding support for multinode nvlink

3

This adds support for multi-node nvlink architecture. In addition it includes changes for making CE deadlock checker configurable at the runtime.

shamisp

Some doubts about the usage of `DelayedScaling.interval`.

Is the `interval` attribute of [DelayedScaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/common.html#transformer_engine.common.recipe.DelayedScaling) not used in PyTorch within the current TransformerEngine? In other words, does the value of `DelayedScaling.interval` affect the computation frequency of the scaling factor...

wzzju

TransformerEngine
TransformerEngine copied to clipboard

Metadata

[PyTorch] Support `torch.amp.autocast` in TE checkpoint

Find CXX component for MPI, fortran and C are not needed

CPU Overhead of te.Linear FP8 Layers

ERROR: Failed building wheel for transformer-engine

Installation failed with cmake error

[Pytorch] Implement fp32 accumulation for attention with context parallel in both forward and backward pass.

[JAX] POC of custom call selection

MLP without LayerNorm

[UB] Adding support for multinode nvlink

Some doubts about the usage of `DelayedScaling.interval`.

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard