TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Internal Error (/lightglue/ArgMax)

Open chenscottus opened this issue 1 year ago • 11 comments

When I run the model with TensorRT 8.6.2 and CUDA 12.2 in Jetson Device, there are a lot of errors: (Ubuntu 22.04, Jack Pack 6.0 R36.2 https://developer.nvidia.com/embedded/jetpack#collapseAllJetson )

  [E] [TRT] 7: [shapeMachine.cpp::nvinfer1::rt::ShapeMachineRoutine::executeContinuation::864] Error Code 7: Internal Error (/lightglue/ArgMax: length of reduction axis ((MIN 2048 (# 0 (VALUE /extractor_1/NonZero_1[size])))) is smaller than K (1) Condition '==' violated: 0 != 1. Instruction: CHECK_EQUAL 0 1.)

There is no such issue when I run the model with TensorRT 8.6.1.6 in Windows.

The model is https://github.com/fabio-sim/LightGlue-ONNX/releases/download/v0.1.3/superpoint_2048_lightglue_end2end.onnx from https://github.com/fabio-sim/LightGlue-ONNX/releases/tag/v0.1.3

Thanks!

-Scott

chenscottus avatar Feb 29 '24 18:02 chenscottus

I can see it has dynamic shapes input, what is the intended input shapes? Could you please a trtexec/polygraphy command that can reproduce the error? Thanks!

zerollzeng avatar Mar 04 '24 14:03 zerollzeng

/usr/bin/trtexec --workspace=40960 --onnx=/home/dev/projects/models/superpoint_2048_lightglue_end2end_tensorrt.onnx --saveEngine=/home/dev/projects/models/superpoint_2048_lightglue_end2end_tensorrt_1x1x3840x640_1x1x3840x128.fp32.trt8.v8.6.2.3.sm87.engine --minShapes=image0:1x1x3840x640,image1:1x1x3840x128 --optShapes=image0:1x1x3840x640,image1:1x1x3840x128 --maxShapes=image0:1x1x3840x640,image1:1x1x3840x128

chenscottus avatar Mar 04 '24 17:03 chenscottus

could you please provide a log with --verbose in jetson, cause I don't have a jetson on my hand.

zerollzeng avatar Mar 08 '24 09:03 zerollzeng

BTW does it work with static shape? e.g. only specify --optShapes

zerollzeng avatar Mar 08 '24 09:03 zerollzeng

minShapes , optShapes and maxShapes are all the same.

chenscottus avatar Mar 08 '24 17:03 chenscottus

When set --verbose, this is the error message: [E] [TRT] 7: [shapeMachine.cpp::executeContinuation::864] Error Code 7: Internal Error (/lightglue/ArgMax_1: length of reduction axis ((MIN 2048 (# 0 (VALUE /extractor/NonZero_1[size])))) is smaller than K (1) Condition '==' violated: 0 != 1. Instruction: CHECK_EQUAL 0 1.)

chenscottus avatar Mar 08 '24 18:03 chenscottus

&&&& PASSED TensorRT.trtexec [TensorRT v100000] # trtexec --onnx=superpoint_2048_lightglue_end2end.onnx --optShapes=image0:1x1x3840x640,image1:1x1x3840x128 Looks like fixed in 10.0, but I don't know when will the be the JP release come with TRT 10. The x86 version should be release soon.

zerollzeng avatar Mar 10 '24 08:03 zerollzeng

BTW I sew many warning like

[03/10/2024-08:01:36] [W] [TRT] /lightglue/self_attn.8/Reshape_14: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[03/10/2024-08:01:36] [W] [TRT] /lightglue/self_attn.8/Reshape_18: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 0 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[03/10/2024-08:01:36] [W] [TRT] /lightglue/self_attn.8/Reshape_15: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[03/10/2024-08:01:36] [W] [TRT] /lightglue/self_attn.8/Reshape_19: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.
[03/10/2024-08:01:36] [W] [TRT] /lightglue/log_assignment.8/Reshape_7: IShuffleLayer with zeroIsPlaceHolder=true has reshape dimension at position 2 that might or might not be zero. TensorRT resolves it at runtime, but this may cause excessive memory consumption and is usually a sign of a bug in the network.

zerollzeng avatar Mar 10 '24 08:03 zerollzeng

I build the engines with the static sharp you mention both on Windows and Jetson Orin with Jackpack DP 6.0 with TensorRT 8.6. Both building the engines are OK. The Logs for building are attached. In Orin still have the same error message while Windows does not. TensorRT-8.6.1.6.windows.txt TensorRT-8.6.2.3.jetson_orin.dp6.txt

chenscottus avatar Mar 12 '24 22:03 chenscottus

ONNX Reshape has an attribute “allowzero” for turning off “zero as placeholder”, then check the warning

lix19937 avatar Mar 15 '24 10:03 lix19937

@chenscottus have you tried building on Jetson Nano? If you have any tips how to do?

aiotko avatar Feb 27 '25 19:02 aiotko