TransformerEngine issues

RuntimeError with Assertion failed: driver_result == cudaDriverEntryPointSuccess.

4

**Describe the bug** when running examples/llama/train_llama3_8b_fp8.sh and building transformer layer of GPTModel: RuntimeError: /TransformerEngine/transformer_engine/common/util/cuda_driver.cpp:42 in function get_symbol: Assertion failed: driver_result == cudaDriverEntryPointSuccess. Could not find CUDA driver entry point for...

Salieri0515

bug

[Draft][JAX] E2E encoder sanity test with synthetic data

1

# Description Currently we have the following: 1. We have some small e2e model training tests for a small model on a few epochs as a sanity integration test. 2....

jberchtold-nvidia

UBNEXT with optional add-rms fuse

# Description Added UBnext fast Allreduce kernels into linear layer. Falls under symmetric_ar_type with new type being 'ubnext' or 'ubnext_add_rms' #Details Added NVLS: simple and low latency (lamport) allreduce kernels...

nv-akorzh

[Draft][JAX] Add "initialize" XLA stage to remaining TE/JAX primitives

# Description Follow-up of PR #2219 to apply warmup initialization to the rest of the TE/JAX primitives. Motivation, identical to previous PR: > If CUDA modules are loaded during any...

jberchtold-nvidia

pipeline aware cpu offload

1

# Description pipeline aware cpu offload ## Type of change - [x] New feature (non-breaking change which adds functionality) ## Changes Please list the changes introduced in this PR: *...

liuzhenhai93

community-contribution

Symmetric Memory Pool

2

# Description This is a more memory efficient version for using symmetric memory all reduces. We use a pool of symmetric memory that we grow if we need it to...

wdykas

community-contribution

Add better ordering enforcment to split_overlap_rs gemms.

2

This adds a short delay kernel to the split_overlap_rs function, which ensures that the gemms are properly ordered when run with cuda graphs. # Description In some situations, such as...

chaseblock

community-contribution

Performance Issue with NVIDIA Transformer Engine FP8 Linear Functions on L20

8

# issue When testing the linear API provided by NVIDIA's transformer engine (with FP8 precision) on an L20 device, I found that its speed is significantly slower than PyTorch's built-in...

cslvjt

Regarding E5M2

3

Hi all , If i wanted to do inference using E5M2 format instead of E4M3 what recipe do i have to use? or Can i use E5M2 for inference even,...

hailo231

question

Using system CUDA libraries

1

**Describe the bug** transformer-engine is currently searching for system CUDA binaries ( https://github.com/NVIDIA/TransformerEngine/blob/67fcc15255248a26be124de3854a47f84102f285/transformer_engine/common/__init__.py#L237). This is in conflict with Pytorch, which uses the CUDA Python packages (https://pypi.org/project/nvidia-cudnn-cu12/). **Steps/Code to reproduce bug**...

spectralflight

bug

TransformerEngine
TransformerEngine copied to clipboard

Metadata

RuntimeError with Assertion failed: driver_result == cudaDriverEntryPointSuccess.

[Draft][JAX] E2E encoder sanity test with synthetic data

UBNEXT with optional add-rms fuse

[Draft][JAX] Add "initialize" XLA stage to remaining TE/JAX primitives

pipeline aware cpu offload

Symmetric Memory Pool

Add better ordering enforcment to split_overlap_rs gemms.

Performance Issue with NVIDIA Transformer Engine FP8 Linear Functions on L20

Regarding E5M2

Using system CUDA libraries

← Metadata

Owner

Metadata

TransformerEngine TransformerEngine copied to clipboard

Metadata

← Metadata

Owner

Metadata

TransformerEngine
TransformerEngine copied to clipboard