cuDLA-samples icon indicating copy to clipboard operation
cuDLA-samples copied to clipboard

cudla import external semaphore FAILED

Open WangFengtu1996 opened this issue 1 year ago • 11 comments

  • I can run deme in hybrid mode successfully.
  • When using standalone mode, the error I got cudla import external semaphore FAILED 1
(base) orin@orin-root:~/workspace/cuDLA-samples$ make run USE_DLA_STANDALONE_MODE=1 -j
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
g++ --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -O2 -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include  -o ./build/cudla_yolov5_app build/decode_nms.o build/validate_coco.o build/yolov5.o build/cudla_context_hybrid.o build/cudla_context_standalone.o -l cudla -L/usr/local/cuda/lib64 -l cuda -l cudart -l nvinfer -L /usr/local/lib/ -l opencv_objdetect -l opencv_highgui -l opencv_imgproc -l opencv_core -l opencv_imgcodecs -L ./src/matx_reformat/build/ -l matx_reformat -l jsoncpp -L /usr/lib/aarch64-linux-gnu/tegra -l nvscibuf -l nvscisync
././build/cudla_yolov5_app --engine ./data/loadable/yolov5.int8.int8hwc4in.fp16chw16out.standalone.bin --image ./data/images/image.jpg --backend cudla_int8
[standalone mode] create CUDLA device SUCCESS
[standalone mode] load cuDLA module from memory SUCCESS
[standalone mode] get number of input tensors SUCCESS
[standalone mode] numInputTensors = 1
[standalone mode] get number of output tensors SUCCESS
[standalone mode] numOutputTensors = 3
[standalone mode] get input tensor descriptor SUCCESS
[standalone mode] get output tensor descriptor SUCCESS
[standalone mode] Printing inputs tensor descriptor
[standalone mode] Printing output tensor descriptor
[standalone mode] open NvSci buffer module SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : images'
[standalone mode] size: 1806336
[standalone mode] dims: [1, 4, 672, 672]
[standalone mode] data fmt: 2
[standalone mode] data type: 4
[standalone mode] data category: 0
[standalone mode] pixel fmt: 12
[standalone mode] pixel mapping: 0
[standalone mode] stride[0]: 1
[standalone mode] stride[1]: 2688
[standalone mode] stride[2]: 0
[standalone mode] stride[3]: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s8'
[standalone mode] size: 3612672
[standalone mode] dims: [1, 255, 84, 84]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] stride[0]: 2
[standalone mode] stride[1]: 2688
[standalone mode] stride[2]: 225792
[standalone mode] stride[3]: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s16'
[standalone mode] size: 903168
[standalone mode] dims: [1, 255, 42, 42]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] stride[0]: 2
[standalone mode] stride[1]: 1344
[standalone mode] stride[2]: 56448
[standalone mode] stride[3]: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] -------------------------------------------
[standalone mode] TENSOR NAME : s32'
[standalone mode] size: 225792
[standalone mode] dims: [1, 255, 21, 21]
[standalone mode] data fmt: 3
[standalone mode] data type: 2
[standalone mode] data category: 2
[standalone mode] pixel fmt: 36
[standalone mode] pixel mapping: 0
[standalone mode] stride[0]: 2
[standalone mode] stride[1]: 672
[standalone mode] stride[2]: 14112
[standalone mode] stride[3]: 0
[standalone mode] create NvSci buffer attr list SUCCESS
[standalone mode] set NvSci buffer attr list SUCCESS
[standalone mode] reconcile NvSciBuf attribute list SUCCESS
[standalone mode] alloc NvSciBuf Obj SUCCESS
[standalone mode] import memory to cudla SUCCESS
[standalone mode] import external memory to cuda SUCCESS
[standalone mode] map external memory to cuda buffer SUCCESS
[standalone mode] -------------------------------------------
[standalone mode] create NvSci sync module SUCCESS
[standalone mode] create NvSci waiter attr list SUCCESS
[standalone mode] create NvSci signal attr list SUCCESS
[standalone mode] get NvSci waiter sync attributes SUCCESS
[standalone mode] cuda get NvSci signal list SUCCESS
[standalone mode] reconciled NvSci sync attr list SUCCESS
[standalone mode] allocate NvSci sync object SUCCESS
[standalone mode] cudla import external semaphore FAILED in src/cudla_context_standalone.cpp:312, CUDLA ERR: 13
make: *** [Makefile:80: run] Error 1

WangFengtu1996 avatar Jan 22 '24 06:01 WangFengtu1996

I can run both modes, but the inference time for each image is 20ms, which is different from what the experiment says, please ask what is the time of your hybrid mode @WangFengtu1996

2yjia avatar Jan 22 '24 07:01 2yjia

@2yjia I can not understand why I can not run successfully in standalone mode. The inference time is about 17ms ~20ms. when warmup is finished, the inference time is shortened. My platform is nvidia jetson AGX ORIN DK. would you give me some guide that inference in standalone mode? thks.

WangFengtu1996 avatar Jan 22 '24 08:01 WangFengtu1996

@2yjia 我参考了这个issue https://github.com/NVIDIA-AI-IOT/cuDLA-samples/issues/7 但是,我这边遇到新的问题

py310) orin@orin-root:~/workspace/cuDLA-samples$ make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1  USE_DETERMINISTIC_SEMAPHORE=1 -j
/usr/local/cuda/bin/nvcc -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -gencode arch=compute_87,code=sm_87 -c -o build/decode_nms.o src/decode_nms.cu
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/yolov5.o src/yolov5.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_hybrid.o src/cudla_context_hybrid.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
src/cudla_context_standalone.cpp: In member function ‘void cuDLAContextStandalone::initialize()’:
src/cudla_context_standalone.cpp:324:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  324 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj, m_WaiterID, m_WaiterValue, m_WaitEventContext.nvsci_fence_ptr);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
src/cudla_context_standalone.cpp:326:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  326 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr,&m_WaiterID,&m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp: In member function ‘int cuDLAContextStandalone::submitDLATask(cudaStream_t)’:
src/cudla_context_standalone.cpp:443:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  443 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr ,&m_WaiterID, &m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp:445:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  445 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj,
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
make: *** [Makefile:69: build/cudla_context_standalone.o] Error 1
make: *** Waiting for unfinished jobs....

WangFengtu1996 avatar Jan 22 '24 09:01 WangFengtu1996

@2yjia 我这边尝试去根据仓库的readme,然后去finetune 模型,导出新的模型,这个流程你走通了么,我在 qat->ptq 这个遇到点问题,缺了输出的这个尺度信息。

(py310) orin@orin-root:~/workspace/cuDLA-samples$ python export/qdq_translator/qdq_translator.py --input_onnx_models=yolov5_trimmed_qat.onnx --output_dir=data/model/ --infer_concat_scales --infer_mul_scales 
INFO:root:Parsing yolov5_trimmed_qat.onnx...
INFO:root:No tensor scales for /model.24/m.0/Conv's output tensor s8
INFO:root:No tensor scales for /model.24/m.1/Conv's output tensor s16
INFO:root:No tensor scales for /model.24/m.2/Conv's output tensor s32

WangFengtu1996 avatar Jan 22 '24 09:01 WangFengtu1996

@2yjia 设备信息, 我们一致么?

(base) orin@orin-root:/usr/lib/aarch64-linux-gnu/tegra$ jetson_release
Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi
Model: Jetson AGX Orin Developer Kit - Jetpack 5.1.2 [L4T 35.4.1]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - P-Number: p3701-0005
 - Module: NVIDIA Jetson AGX Orin (64GB ram)
Platform:
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.120-tegra
jtop:
 - Version: 4.2.4
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 5.1.2
 - VPI: 2.3.9
 - Vulkan: 1.3.204
 - OpenCV: 4.6.0 - with CUDA: YES

WangFengtu1996 avatar Jan 22 '24 09:01 WangFengtu1996

Software part of jetson-stats 4.2.4 - (c) 2024, Raffaello Bonghi Model: Jetson AGX Orin - Jetpack 5.1 [L4T 35.2.1] NV Power Mode[2]: MODE_30W Serial Number: [XXX Show with: jetson_release -s XXX] Hardware:

  • P-Number: p3701-0005
  • Module: NVIDIA Jetson AGX Orin (64GB ram) Platform:
  • Distribution: Ubuntu 20.04 focal
  • Release: 5.10.104-tegra jtop:
  • Version: 4.2.4
  • Service: Active Libraries:
  • CUDA: 11.4.315
  • cuDNN: 8.6.0.166
  • TensorRT: 5.1
  • VPI: 2.2.4
  • Vulkan: 1.3.204
  • OpenCV: 4.5.4 - with CUDA: NO @WangFengtu1996

2yjia avatar Jan 22 '24 09:01 2yjia

@2yjia 我尝试去根据仓库的自述文件,然后去微调模型,导出新的模型,这个流程你走通了么,我在qat->ptq这个遇到点问题,缺了输出的这个图形信息。

(py310) orin@orin-root:~/workspace/cuDLA-samples$ python export/qdq_translator/qdq_translator.py --input_onnx_models=yolov5_trimmed_qat.onnx --output_dir=data/model/ --infer_concat_scales --infer_mul_scales 
INFO:root:Parsing yolov5_trimmed_qat.onnx...
INFO:root:No tensor scales for /model.24/m.0/Conv's output tensor s8
INFO:root:No tensor scales for /model.24/m.1/Conv's output tensor s16
INFO:root:No tensor scales for /model.24/m.2/Conv's output tensor s32

同样的问题,运行程序后生成了noqdq.onnx,我用这个onnx进行推理部署有一定的问题,不知道作者的fp16和int8两个onnx怎么生成的

2yjia avatar Jan 22 '24 10:01 2yjia

@2yjia 我参考了这个issue #7 但是,我这边遇到新的问题

py310) orin@orin-root:~/workspace/cuDLA-samples$ make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1  USE_DETERMINISTIC_SEMAPHORE=1 -j
/usr/local/cuda/bin/nvcc -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include -gencode arch=compute_87,code=sm_87 -c -o build/decode_nms.o src/decode_nms.cu
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/validate_coco.o src/validate_coco.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/yolov5.o src/yolov5.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_hybrid.o src/cudla_context_hybrid.cpp
g++ -I /usr/local/cuda/include -I ./src/matx_reformat/ -I /usr/local/include/opencv4/ -I /usr/include/jsoncpp/ -I /usr/include --std=c++14 -Wno-deprecated-declarations -Wall -DUSE_DLA_STANDALONE_MODE -DUSE_DETERMINISTIC_SEMAPHORE -O2 -c -o build/cudla_context_standalone.o src/cudla_context_standalone.cpp
src/cudla_context_standalone.cpp: In member function ‘void cuDLAContextStandalone::initialize()’:
src/cudla_context_standalone.cpp:324:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  324 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj, m_WaiterID, m_WaiterValue, m_WaitEventContext.nvsci_fence_ptr);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
src/cudla_context_standalone.cpp:326:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  326 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr,&m_WaiterID,&m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp: In member function ‘int cuDLAContextStandalone::submitDLATask(cudaStream_t)’:
src/cudla_context_standalone.cpp:443:19: error: ‘NvSciSyncFenceExtractFence’ was not declared in this scope; did you mean ‘NvSciSyncIpcExportFence’?
  443 |     m_nvsci_err = NvSciSyncFenceExtractFence(m_WaitEventContext.nvsci_fence_ptr ,&m_WaiterID, &m_WaiterValue);
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncIpcExportFence
src/cudla_context_standalone.cpp:445:19: error: ‘NvSciSyncFenceUpdateFence’ was not declared in this scope; did you mean ‘NvSciSyncObjGenerateFence’?
  445 |     m_nvsci_err = NvSciSyncFenceUpdateFence(m_WaitEventContext.sync_obj,
      |                   ^~~~~~~~~~~~~~~~~~~~~~~~~
      |                   NvSciSyncObjGenerateFence
make: *** [Makefile:69: build/cudla_context_standalone.o] Error 1
make: *** Waiting for unfinished jobs....

@2yjia 关于我这个问题,你能在解压nvsci*出来目录,帮我 grep 下着两个函数,看下结果么? 十分感谢哈

# 进入 nvsci_headers.tbz2 解压目录
grep -nr "NvSciSyncFenceUpdateFence"

grep -nr "NvSciSyncObjGenerateFence"

WangFengtu1996 avatar Jan 23 '24 01:01 WangFengtu1996

image image I encountered the same problem, has it been solved?

ou525 avatar Jan 26 '24 06:01 ou525

Hi All, could you try this on Jetpack 6.0 DP+. Thanks!

mchi-zg avatar May 08 '24 14:05 mchi-zg