djl djl runs failure with dynamic input format

trafficstars

Description

Use same onnx model as input to transform to tensorrt model with trtexec，if fix input format djl runs ok:

TensorRT-8.4.1.5/bin/trtexec --onnx=models/model.onnx --shapes=input_ids:1x4  --fp16 --saveEngine=model8415-1*4.trt

output:

DEBUG [main] 2023-01-05 00:47:05  Registering EngineProvider: TensorRT
DEBUG [main] 2023-01-05 00:47:05  Registering EngineProvider: TensorFlow
DEBUG [main] 2023-01-05 00:47:05  Registering EngineProvider: MXNet
DEBUG [main] 2023-01-05 00:47:05  Found default engine: MXNet
DEBUG [main] 2023-01-05 00:47:05  Loading TensorRT JNI library from: /root/.djl.ai/tensorrt/8.4.1-0.19.0-linux-x86_64/libdjl_trt.so
DEBUG [main] 2023-01-05 00:47:05  Scanning models in repo: class ai.djl.repository.SimpleRepository, file:/tensorrt/model8415-1*4.trt
DEBUG [main] 2023-01-05 00:47:05  Loading model with Criteria:
        Application: UNDEFINED
        Input: class djl.input.TensorRTInput
        Output: class djl.output.TensorRTOutput
        Engine: TensorRT
        ModelZoo: ai.djl.localmodelzoo

DEBUG [main] 2023-01-05 00:47:05  Searching model in specified model zoo: ai.djl.localmodelzoo
 WARN [main] 2023-01-05 00:47:05  Simple repository pointing to a non-archive file.
DEBUG [main] 2023-01-05 00:47:05  Checking ModelLoader: ai.djl.localmodelzoo:model8415-1*4.trt UNDEFINED [
        ai.djl.localmodelzoo/model8415-1*4.trt/model8415-1*4.trt {}
]
DEBUG [main] 2023-01-05 00:47:05  Preparing artifact: file:/tensorrt/model8415-1*4.trt, ai.djl.localmodelzoo/model8415-1*4.trt/model8415-1*4.trt {}
DEBUG [main] 2023-01-05 00:47:05  Skip prepare for local repository.
Loading:     100% |████████████████████████████████████████|
DEBUG [main] 2023-01-05 00:47:05  Using cache dir: /root/.djl.ai/mxnet/1.9.1-cu114mkl-linux-x86_64
DEBUG [main] 2023-01-05 00:47:05  Loading mxnet library from: /root/.djl.ai/mxnet/1.9.1-cu114mkl-linux-x86_64/libmxnet.so
DEBUG [main] 2023-01-05 00:47:07  Using cache dir: /root/.djl.ai/tensorflow
DEBUG [main] 2023-01-05 00:47:07  Loading TensorFlow library from: /root/.djl.ai/tensorflow/2.7.4-cu114-linux-x86_64/libjnitensorflow.so
DEBUG [main] 2023-01-05 00:47:07  Loading TensorRT UFF model /tensorrt/model8415-1*4.trt with options:
[TRT] INFO:  [MemUsageChange] Init CUDA: CPU +273, GPU +0, now: CPU 1200, GPU 491 (MiB)
[TRT] INFO:  Loaded engine size: 681 MiB
[TRT] INFO:  [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +680, now: CPU 0, GPU 680 (MiB)
[TRT] INFO:  [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 682 (MiB)
DEBUG [main] 2023-01-05 00:47:11  Model information:
DEBUG [main] 2023-01-05 00:47:11  input_0[input_ids]: int32, (1, 4)
DEBUG [main] 2023-01-05 00:47:11  output_0[output]: float32, (1, 4, 51200)
load model success
result size: 1
result: output: (4, 51200) cpu() float32
[ Exceed max print size ]
[[F@f237ae7, [F@42edde25, [F@6fe5da76, [F@77d95e5a]

dynamic input format runs failure core dump with message "Cuda failure: 2", but without core dump file：

TensorRT-8.4.1.5/bin/trtexec --onnx=models/model.onnx --minShapes=input_ids:1x1 --maxShapes=input_ids:1x16 --optShapes=input_ids:1x4 --workspace=3072 --fp16 --saveEngine=model8415-dynamic.trt

output：

DEBUG [main] 2023-01-05 02:58:57  Scanning models in repo: class ai.djl.repository.SimpleRepository, file:/tensorrt/model8415-dynamic.trt
DEBUG [main] 2023-01-05 02:58:57  Loading model with Criteria:
        Application: UNDEFINED
        Input: class djl.input.TensorRTInput
        Output: class djl.output.TensorRTOutput
        Engine: TensorRT
        ModelZoo: ai.djl.localmodelzoo

DEBUG [main] 2023-01-05 02:58:57  Searching model in specified model zoo: ai.djl.localmodelzoo
DEBUG [main] 2023-01-05 02:58:57  Registering EngineProvider: TensorRT
DEBUG [main] 2023-01-05 02:58:57  Registering EngineProvider: TensorFlow
DEBUG [main] 2023-01-05 02:58:57  Registering EngineProvider: MXNet
DEBUG [main] 2023-01-05 02:58:57  Found default engine: MXNet
 WARN [main] 2023-01-05 02:58:57  Simple repository pointing to a non-archive file.
DEBUG [main] 2023-01-05 02:58:57  Checking ModelLoader: ai.djl.localmodelzoo:model8415-dynamic.trt UNDEFINED [
        ai.djl.localmodelzoo/model8415-dynamic.trt/model8415-dynamic.trt {}
]
DEBUG [main] 2023-01-05 02:58:57  Preparing artifact: file:/tensorrt/model8415-dynamic.trt, ai.djl.localmodelzoo/model8415-dynamic.trt/model8415-dynamic.trt {}
DEBUG [main] 2023-01-05 02:58:57  Skip prepare for local repository.
Loading:     100% |████████████████████████████████████████|
DEBUG [main] 2023-01-05 02:58:58  Loading TensorRT JNI library from: /root/.djl.ai/tensorrt/8.4.1-0.19.0-linux-x86_64/libdjl_trt.so
DEBUG [main] 2023-01-05 02:58:58  Using cache dir: /root/.djl.ai/mxnet/1.9.1-cu114mkl-linux-x86_64
DEBUG [main] 2023-01-05 02:58:58  Loading mxnet library from: /root/.djl.ai/mxnet/1.9.1-cu114mkl-linux-x86_64/libmxnet.so
DEBUG [main] 2023-01-05 02:58:59  Using cache dir: /root/.djl.ai/tensorflow
DEBUG [main] 2023-01-05 02:58:59  Loading TensorFlow library from: /root/.djl.ai/tensorflow/2.7.4-cu114-linux-x86_64/libjnitensorflow.so
DEBUG [main] 2023-01-05 02:58:59  Loading TensorRT UFF model /tensorrt/model8415-dynamic.trt with options:
[TRT] INFO:  [MemUsageChange] Init CUDA: CPU +273, GPU +0, now: CPU 1538, GPU 491 (MiB)
[TRT] INFO:  Loaded engine size: 1006 MiB
[TRT] INFO:  [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1001, now: CPU 0, GPU 1001 (MiB)
[TRT] INFO:  [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 0, GPU 1009 (MiB)
Cuda failure: 2
Aborted (core dumped)

It is worth noting that model8415-dynamic.trt works fine use python tensorrt.

Environment Info

cuda: 11.4 tensorrt: 8.4.1.5 os: ubuntu18.04

Jan 05 '23 09:01 zjcDM

Hi, thanks for bringing up this issue. I see that in the second case with the error, the reported error is Cuda failure: 2. This maps to an insufficient memory error cudaErrorMemoryAllocation https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html. Have you tried increasing the workspace size with trtexec (or alternatively use memPoolSize as workspace flag has been deprecated)? Does the GPU have sufficient memory (I am assuming yes since you indicate this works via python)?

If the workspace/memPoolSize increase doesn't solve this issue, can you provide the onnx model you are using so we can work on reproducing the issue?

Jan 05 '23 19:01 siddvenk

Hi, thanks for bringing up this issue. I see that in the second case with the error, the reported error is Cuda failure: 2. This maps to an insufficient memory error cudaErrorMemoryAllocation https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html. Have you tried increasing the workspace size with trtexec (or alternatively use memPoolSize as workspace flag has been deprecated)? Does the GPU have sufficient memory (I am assuming yes since you indicate this works via python)?

If the workspace/memPoolSize increase doesn't solve this issue, can you provide the onnx model you are using so we can work on reproducing the issue?

Hi siddvenk, thanks for help，i can not provide this onnx model to you for some reason, i'll try to find a open source model to reproduce this problem, please give me some time.

Jan 11 '23 01:01 zjcDM

Hi, thanks for bringing up this issue. I see that in the second case with the error, the reported error is Cuda failure: 2. This maps to an insufficient memory error cudaErrorMemoryAllocation https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html. Have you tried increasing the workspace size with trtexec (or alternatively use memPoolSize as workspace flag has been deprecated)? Does the GPU have sufficient memory (I am assuming yes since you indicate this works via python)?

If the workspace/memPoolSize increase doesn't solve this issue, can you provide the onnx model you are using so we can work on reproducing the issue?

Hi siddvenk, onnx mode and demo code already sent to your email ([email protected]), please check and work on reproducing when you free, thanks.

Feb 12 '23 15:02 zjcDM

djl djl copied to clipboard

djl runs failure with dynamic input format

Description

Environment Info

djl
djl copied to clipboard