NudeNet icon indicating copy to clipboard operation
NudeNet copied to clipboard

CUDA error when inference with onnxruntime-gpu

Open yanyabo111 opened this issue 3 years ago • 6 comments

When I tried to inference the model with the onnxruntime-gpu, a CUDA error occured.

def __init__(self, model_name="default"):
    checkpoint_path = '/root/tensor/nudenet/checkpoint/detector_v2_default_checkpoint.onnx'
    classes_path = '/root/tensor/nudenet/checkpoint/detector_v2_default_classes'

    # CPUExecutionProvider CUDAExecutionProvider
    self.detection_model = onnxruntime.InferenceSession(checkpoint_path, providers=["CUDAExecutionProvider"])
    self.classes = [c.strip() for c in open(classes_path).readlines() if c.strip()]

The error is

2021-03-07 06:53:12.871020963 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running GatherND node. Name:'filtered_detections/map/while/GatherNd_28' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument
2021-03-07 06:53:12.871079083 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Loop node. Name:'generic_loop_Loop__492' Status Message: Non-zero status code returned while running GatherND node. Name:'filtered_detections/map/while/GatherNd_28' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument
Traceback (most recent call last):
  File "detector.py", line 115, in <module>
    print(m.detect("/root/tensor/image-quality-assessment/t1.jpg"))
  File "detector.py", line 90, in detect
    outputs = self.detection_model.run(
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Loop node. Name:'generic_loop_Loop__492' Status Message: Non-zero status code returned while running GatherND node. Name:'filtered_detections/map/while/GatherNd_28' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument

Specify versions of the following libraries

  1. nudenet
  2. onnxruntime-gpu: 1.7
  3. CUDA 11.0.3 and cuDNN 8.0.2.4
  4. RTX 3090

When I run the model with onnxruntime CPU, everything is fine. I also convert the onnx model to pb, and it can run on the tfserving_gpu docker.

Is the node config need be change or is the onnxruntime-gpu's problem?

yanyabo111 avatar Mar 07 '21 08:03 yanyabo111

@yanyabo111 I am able to reproduce the issue. I will try to figure it out when I get some free time. Meanwhile, you can fallback to previous versions of nudenet and use the tensorflow versions (that works with gpu).

bedapudi6788 avatar Mar 09 '21 05:03 bedapudi6788

@bedapudi6788 Really appreciate your hard work, is there anything I can help?

yanyabo111 avatar Apr 26 '21 12:04 yanyabo111

Having the same issue here. Also tried other versions of onnxruntime-gpu (1.4.0 to 1.7.0)

SiavashCS avatar Apr 28 '21 20:04 SiavashCS

FYI - I hit the same bug with the ONNX model in releases, and was able to resolve it by converting the TensorFlow model (detector_v2_default_checkpoint_tf) to opset 11. I pulled down the TensorFlow model (detector_v2_default_checkpoint_tf), converted it to ONNX using tf2onnx, and no more exception.

The tf2onnx command I used after I downloaded the TF model was:

python -m tf2onnx.convert --saved-model c:\saved_model_dir --opset 11 --output saved_model.onnx

Hope that helps!

mrjarhead avatar May 09 '21 00:05 mrjarhead

FYI - I hit the same bug with the ONNX model in releases, and was able to resolve it by converting the TensorFlow model (detector_v2_default_checkpoint_tf) to opset 11. I pulled down the TensorFlow model (detector_v2_default_checkpoint_tf), converted it to ONNX using tf2onnx, and no more exception.

The tf2onnx command I used after I downloaded the TF model was:

python -m tf2onnx.convert --saved-model c:\saved_model_dir --opset 11 --output saved_model.onnx

Hope that helps!

Worked :) thanks a lot.

SiavashCS avatar Jun 12 '21 13:06 SiavashCS

i met similar problem when a inference on cuda: FAIL : Non-zero status code returned while running TopK node. Name:'/model/TopK' Status Message: CUDA error cudaErrorInvalidConfiguration:invalid configuration argument can you help me with my problem? thank you! @mrjarhead

Zalways avatar Jan 23 '24 06:01 Zalways