🐛 [Bug] [NGC] L0 Dynamo Test on Thor

Open apbose opened this issue 3 months ago • 1 comments

Bug:

FAILED conversion/test_scalar_tensor_aten.py::TestScalarTensorConverter::test_scalar_tensor_float_1 FAILED conversion/test_index_aten.py::TestIndexConverter::test_index_zero_two_dim_ITensor_mask

TRT 10.13.3.9 Pytorch 2.10.0a0+b558c986e8

Error:

2025-10-11T19:58:31.844970Z 01O ------------------------------ Captured log call -------------------------------
2025-10-11T19:58:31.844990Z 01O WARNING  torch_tensorrt [TensorRT Conversion Context]:logging.py:24 Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
2025-10-11T19:58:31.845010Z 01O WARNING  torch_tensorrt [TensorRT Conversion Context]:logging.py:24 Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
2025-10-11T19:58:31.845030Z 01O ERROR    torch_tensorrt [TensorRT Conversion Context]:logging.py:22 Error Code: 9: Skipping tactic 0x00000000000003e8 due to exception cudaEventElapsedTime In executeAndTimeIters at optimizer/common/builderUtils.cpp:1026
2025-10-11T19:58:31.845060Z 01O ERROR    torch_tensorrt [TensorRT Conversion Context]:logging.py:22 Error Code: 9: Skipping tactic 0x0000000000000000 due to exception cudaEventElapsedTime In executeAndTimeIters at optimizer/common/builderUtils.cpp:1026

Oct 14 '25 18:10 apbose

Some of the test failures above are due to the non zero unsupported case on Thor. Others fail with the issue of 2025-10-30T04:16:27.150332Z 01O ERROR torch_tensorrt [TensorRT Conversion Context]:logging.py:22 IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node [ShapeHostToDeviceCopy 0]. In computeCosts at optimizer/common/tactic/optimizer.cpp:4115)

eg: test_full_aten.py fails in the static case. The graph does not encounter full operation though

graph():
    %x : [num_users=0] = placeholder[target=x]
    %_tensor_constant0 : [num_users=1] = get_attr[target=_tensor_constant0]
    return _tensor_constant0

Looks like incomplete cuda context initialization while selecting TRT tactic. Following up with TRT team.

Nov 07 '25 21:11 apbose