TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

PyTorch 2.2.0 NVFuser deprecation is incompatible with TransformerEngine.

Open sirutBuasai opened this issue 1 year ago • 3 comments

In recent PyTorch 2.2.0 release, they have deprecated NVFuser in torch script with this warning. See this commit.

We are running into tests failure on TransformerEngine when running the following code:

TE_VERSION="<DECLARE TE VERSION HERE>"
git clone --branch release_v$TE_VERSION https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine/tests/pytorch

pip install pytest==6.2.5 onnxruntime==1.13.1 onnx
pytest -v -s test_sanity.py
PYTORCH_JIT=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -v -s test_numerics.py
NVTE_TORCH_COMPILE=0 pytest -v -s test_onnx_export.py
pytest -v -s test_jit.py

The errors we're seeing:

E           Exit code: 1
E           
E           Stdout:
E           
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision1-False-True-padding-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision2-False-False-no_mask-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision2-False-True-padding-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision0-False-False-no_mask-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision0-False-True-padding-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision1-False-False-no_mask-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision1-False-True-padding-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision2-False-False-no_mask-False]
E           FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision2-False-True-padding-False]
E           ===== 306 failed, 11 passed, 428 skipped, 128 warnings in 88.03s (0:01:28) =====
E           
E           Stderr:
E           
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E           [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())

sirutBuasai avatar Feb 13 '24 23:02 sirutBuasai

Hi team, any plan to fix this? without transformer engine working it's hard to justify the price for H100s.

roywei avatar Apr 17 '24 17:04 roywei

@timmoon10 Could you take a look at this?

I don't believe that the warning is the reason of the failure of the tests (but we should fix that nonetheless). Deprecation of torchscript in general is a bigger problem for the ONNX export, but I believe we included a WAR for that few months ago.

ptrendx avatar May 16 '24 18:05 ptrendx

Pytorch has decided to drop NVfuser support , see this PR#105789 which later reverted by @DanilBaibak . Not sure whether they still have the plan to move forward. But it is still on the menu for further discussion