TransformerEngine
TransformerEngine copied to clipboard
PyTorch 2.2.0 NVFuser deprecation is incompatible with TransformerEngine.
In recent PyTorch 2.2.0 release, they have deprecated NVFuser in torch script with this warning. See this commit.
We are running into tests failure on TransformerEngine when running the following code:
TE_VERSION="<DECLARE TE VERSION HERE>"
git clone --branch release_v$TE_VERSION https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine/tests/pytorch
pip install pytest==6.2.5 onnxruntime==1.13.1 onnx
pytest -v -s test_sanity.py
PYTORCH_JIT=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -v -s test_numerics.py
NVTE_TORCH_COMPILE=0 pytest -v -s test_onnx_export.py
pytest -v -s test_jit.py
The errors we're seeing:
E Exit code: 1
E
E Stdout:
E
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision1-False-True-padding-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision2-False-False-no_mask-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-False-precision2-False-True-padding-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision0-False-False-no_mask-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision0-False-True-padding-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision1-False-False-no_mask-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision1-False-True-padding-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision2-False-False-no_mask-False]
E FAILED test_onnx_export.py::test_export_transformer_layer[swiglu-True-True-precision2-False-True-padding-False]
E ===== 306 failed, 11 passed, 428 skipped, 128 warnings in 88.03s (0:01:28) =====
E
E Stderr:
E
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
E [W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
Hi team, any plan to fix this? without transformer engine working it's hard to justify the price for H100s.
@timmoon10 Could you take a look at this?
I don't believe that the warning is the reason of the failure of the tests (but we should fix that nonetheless). Deprecation of torchscript in general is a bigger problem for the ONNX export, but I believe we included a WAR for that few months ago.
Pytorch has decided to drop NVfuser support , see this PR#105789 which later reverted by @DanilBaibak . Not sure whether they still have the plan to move forward. But it is still on the menu for further discussion