TensorRT
TensorRT copied to clipboard
Potential Tensorrt Bug on WSL2 (8.4 GA)
Description
I tried building tkdnn on wsl2 with TensorRt 8.4 GA ,cudnn 8.4 and cuda 11.7 , the library and its tests compile without any issues but when i run any test (test_yolo4tiny in this context) both cudnn vs trt checks and trt vs correct checks fail.This doesn't seem to be an issue on native linux (ubuntu 20.04) and windows 11.(i also copied libraries from /usr/lib/wsl/lib to /usr/lib/x86_64-linux )
If i build the library with Tensorrt 8.2 GA,cudnn 8.2 and cuda 11.4 the above mentioned issue doesn't happen in wsl2 ,but I have had to copy libraries from /usr/lib/wsl/lib to /usr/lib/x86_64-linux for it to work properly ,if those libraries aren't copied i encounter the same TRT vs correct and TRT vs cudnn check fails.
Since the checks fails , there is almost no detection from any of the networks (I have attached their images below)
I have attached the logs of running test_yolo4tiny both with tensorrt 8.4(fails checks) and tensorrt 8.2(passes checks)
TLDR the libraries checks fails on wsl2 when compiled with Tensorrt 8.4 but the same checks pass on native linux or windows and on a side note ,I have always had to copy files from /usr/lib/wsl/lib to /usr/lib/x86_64-linux even since tensorrt 7 for the library to work properly.
Environment
TensorRT Version: 8.4.1 / 8.2.5.1 NVIDIA GPU: 3070 Laptop Gpu NVIDIA Driver Version: 516.40 CUDA Version: 11.7 / 11.4 CUDNN Version: 8.4 / 8.2 Operating System: Windows 11 (WSL-2 Ubuntu -20.04) Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):
Relevant Files
https://github.com/ceccocats/tkDNN
Logs
TensorRT 8.4 + Cudnn 8.4 + Cuda 11.7
./test_yolo4tiny Not supported field: batch=64 Not supported field: subdivisions=1 Not supported field: momentum=0.9 Not supported field: decay=0.0005 Not supported field: angle=0 Not supported field: saturation = 1.5 Not supported field: exposure = 1.5 Not supported field: hue=.1 Not supported field: learning_rate=0.00261 Not supported field: burn_in=1000 Not supported field: max_batches = 500200 Not supported field: policy=steps Not supported field: steps=400000,450000 Not supported field: scales=.1,.1 New NETWORK (tkDNN v0.7, CUDNN v8.401) Reading weights: I=3 O=32 KERNEL=3x3x1 Reading weights: I=32 O=64 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=32 O=32 KERNEL=3x3x1 Reading weights: I=32 O=32 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=1x1x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=1x1x1 Reading weights: I=256 O=256 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=256 O=256 KERNEL=1x1x1 Reading weights: I=512 O=512 KERNEL=3x3x1 Reading weights: I=512 O=256 KERNEL=1x1x1 Reading weights: I=256 O=512 KERNEL=3x3x1 Reading weights: I=512 O=255 KERNEL=1x1x1 Not supported field: anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 Not supported field: jitter=.3 Not supported field: cls_normalizer=1.0 Not supported field: iou_normalizer=0.07 Not supported field: iou_loss=ciou Not supported field: ignore_thresh = .7 Not supported field: truth_thresh = 1 Not supported field: random=0 Reading weights: I=256 O=128 KERNEL=1x1x1 Reading weights: I=384 O=256 KERNEL=3x3x1 Reading weights: I=256 O=255 KERNEL=1x1x1 Not supported field: anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 Not supported field: jitter=.3 Not supported field: cls_normalizer=1.0 Not supported field: iou_normalizer=0.07 Not supported field: iou_loss=ciou Not supported field: ignore_thresh = .7 Not supported field: truth_thresh = 1 Not supported field: random=0
====================== NETWORK MODEL ====================== N. Layer type input (HW,CH) output (HW,CH) 0 Conv2d 416 x 416, 3 -> 208 x 208, 32 1 ActivationLeaky 208 x 208, 32 -> 208 x 208, 32 2 Conv2d 208 x 208, 32 -> 104 x 104, 64 3 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 4 Conv2d 104 x 104, 64 -> 104 x 104, 64 5 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 6 Route 104 x 104, 32 -> 104 x 104, 32 7 Conv2d 104 x 104, 32 -> 104 x 104, 32 8 ActivationLeaky 104 x 104, 32 -> 104 x 104, 32 9 Conv2d 104 x 104, 32 -> 104 x 104, 32 10 ActivationLeaky 104 x 104, 32 -> 104 x 104, 32 11 Route 104 x 104, 64 -> 104 x 104, 64 12 Conv2d 104 x 104, 64 -> 104 x 104, 64 13 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 14 Route 104 x 104, 128 -> 104 x 104, 128 15 Pooling 104 x 104, 128 -> 52 x 52, 128 16 Conv2d 52 x 52, 128 -> 52 x 52, 128 17 ActivationLeaky 52 x 52, 128 -> 52 x 52, 128 18 Route 52 x 52, 64 -> 52 x 52, 64 19 Conv2d 52 x 52, 64 -> 52 x 52, 64 20 ActivationLeaky 52 x 52, 64 -> 52 x 52, 64 21 Conv2d 52 x 52, 64 -> 52 x 52, 64 22 ActivationLeaky 52 x 52, 64 -> 52 x 52, 64 23 Route 52 x 52, 128 -> 52 x 52, 128 24 Conv2d 52 x 52, 128 -> 52 x 52, 128 25 ActivationLeaky 52 x 52, 128 -> 52 x 52, 128 26 Route 52 x 52, 256 -> 52 x 52, 256 27 Pooling 52 x 52, 256 -> 26 x 26, 256 28 Conv2d 26 x 26, 256 -> 26 x 26, 256 29 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 30 Route 26 x 26, 128 -> 26 x 26, 128 31 Conv2d 26 x 26, 128 -> 26 x 26, 128 32 ActivationLeaky 26 x 26, 128 -> 26 x 26, 128 33 Conv2d 26 x 26, 128 -> 26 x 26, 128 34 ActivationLeaky 26 x 26, 128 -> 26 x 26, 128 35 Route 26 x 26, 256 -> 26 x 26, 256 36 Conv2d 26 x 26, 256 -> 26 x 26, 256 37 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 38 Route 26 x 26, 512 -> 26 x 26, 512 39 Pooling 26 x 26, 512 -> 13 x 13, 512 40 Conv2d 13 x 13, 512 -> 13 x 13, 512 41 ActivationLeaky 13 x 13, 512 -> 13 x 13, 512 42 Conv2d 13 x 13, 512 -> 13 x 13, 256 43 ActivationLeaky 13 x 13, 256 -> 13 x 13, 256 44 Conv2d 13 x 13, 256 -> 13 x 13, 512 45 ActivationLeaky 13 x 13, 512 -> 13 x 13, 512 46 Conv2d 13 x 13, 512 -> 13 x 13, 255 47 Yolo 13 x 13, 255 -> 13 x 13, 255 48 Route 13 x 13, 256 -> 13 x 13, 256 49 Conv2d 13 x 13, 256 -> 13 x 13, 128 50 ActivationLeaky 13 x 13, 128 -> 13 x 13, 128 51 Upsample 13 x 13, 128 -> 26 x 26, 128 52 Route 26 x 26, 384 -> 26 x 26, 384 53 Conv2d 26 x 26, 384 -> 26 x 26, 256 54 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 55 Conv2d 26 x 26, 256 -> 26 x 26, 255 56 Yolo 26 x 26, 255 -> 26 x 26, 255
N params: 6049888 Max feature map size: 2768896 N MACC: 3453938176
GPU free memory: 6350.18 mb. New NetworkRT (TensorRT v8.41) Float16 support: 1 Int8 support: 1 DLAs: 0 Selected maxBatchSize: 1 GPU free memory: 6190.79 mb. Building tensorRT cuda engine... saving serialized network to file 25871156 create execution context Input/outputs numbers: 3 input index = 0 -> output index = 2 Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1 NUMBER OF LAYERS IN NETWORK : 95 NUMBER OF LAYERS IN ENGINE : 54 RtBuffer 0 dim: Data dim: 1 3 416 416 1 RtBuffer 1 dim: Data dim: 1 255 13 13 1 RtBuffer 2 dim: Data dim: 1 255 26 26 1
====== CUDNN inference ====== Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1
===== TENSORRT inference ==== Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1
=== OUTPUT 0 CHECK RESULTS == CUDNN vs correct | OK ~0.02 TRT vs correct | [ 0 ]: 0.5 0.678479 | [ 1 ]: 0.5 0.445334 | [ 2 ]: 0.5 0.40175 | [ 3 ]: 0.5 0.459043 | [ 4 ]: 0.5 0.467688 | [ 5 ]: 0.5 0.523031 | [ 6 ]: 0.5 0.540694 | [ 8 ]: 0.5 0.522987 | [ 9 ]: 0.5 0.521413 | Wrongs: 42952 ~0.02 CUDNN vs TRT | [ 0 ]: 0.678572 0.5 | [ 1 ]: 0.44539 0.5 | [ 2 ]: 0.401747 0.5 | [ 3 ]: 0.459072 0.5 | [ 4 ]: 0.467634 0.5 | [ 5 ]: 0.523156 0.5 | [ 6 ]: 0.540672 0.5 | [ 8 ]: 0.522996 0.5 | [ 9 ]: 0.521339 0.5 | Wrongs: 42952 ~0.02
=== OUTPUT 1 CHECK RESULTS == CUDNN vs correct | OK ~0.02 TRT vs correct | [ 0 ]: 0.5 0.607734 | [ 1 ]: 0.5 0.541202 | [ 2 ]: 0.5 0.428249 | [ 4 ]: 0.5 0.477934 | [ 5 ]: 0.5 0.474406 | [ 6 ]: 0.5 0.458975 | [ 7 ]: 0.5 0.466028 | [ 8 ]: 0.5 0.459243 | [ 9 ]: 0.5 0.457477 | Wrongs: 171635 ~0.02 CUDNN vs TRT | [ 0 ]: 0.607805 0.5 | [ 1 ]: 0.541125 0.5 | [ 2 ]: 0.428199 0.5 | [ 4 ]: 0.477936 0.5 | [ 5 ]: 0.4743 0.5 | [ 6 ]: 0.459102 0.5 | [ 7 ]: 0.466021 0.5 | [ 8 ]: 0.459368 0.5 | [ 9 ]: 0.457369 0.5 | Wrongs: 171637 ~0.02 12
Tensorrt 8.2 + cudnn 8.2 + cuda 11.4 ~/Development/tkdnn_temp/build$ ./test_yolo4tiny Not supported field: batch=64 Not supported field: subdivisions=1 Not supported field: momentum=0.9 Not supported field: decay=0.0005 Not supported field: angle=0 Not supported field: saturation = 1.5 Not supported field: exposure = 1.5 Not supported field: hue=.1 Not supported field: learning_rate=0.00261 Not supported field: burn_in=1000 Not supported field: max_batches = 500200 Not supported field: policy=steps Not supported field: steps=400000,450000 Not supported field: scales=.1,.1 New NETWORK (tkDNN v0.7, CUDNN v8.204) Reading weights: I=3 O=32 KERNEL=3x3x1 Reading weights: I=32 O=64 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=32 O=32 KERNEL=3x3x1 Reading weights: I=32 O=32 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=1x1x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=64 O=64 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=1x1x1 Reading weights: I=256 O=256 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=128 O=128 KERNEL=3x3x1 Reading weights: I=256 O=256 KERNEL=1x1x1 Reading weights: I=512 O=512 KERNEL=3x3x1 Reading weights: I=512 O=256 KERNEL=1x1x1 Reading weights: I=256 O=512 KERNEL=3x3x1 Reading weights: I=512 O=255 KERNEL=1x1x1 Not supported field: anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 Not supported field: jitter=.3 Not supported field: cls_normalizer=1.0 Not supported field: iou_normalizer=0.07 Not supported field: iou_loss=ciou Not supported field: ignore_thresh = .7 Not supported field: truth_thresh = 1 Not supported field: random=0 Reading weights: I=256 O=128 KERNEL=1x1x1 Reading weights: I=384 O=256 KERNEL=3x3x1 Reading weights: I=256 O=255 KERNEL=1x1x1 Not supported field: anchors = 10,14, 23,27, 37,58, 81,82, 135,169, 344,319 Not supported field: jitter=.3 Not supported field: cls_normalizer=1.0 Not supported field: iou_normalizer=0.07 Not supported field: iou_loss=ciou Not supported field: ignore_thresh = .7 Not supported field: truth_thresh = 1 Not supported field: random=0
====================== NETWORK MODEL ====================== N. Layer type input (HW,CH) output (HW,CH) 0 Conv2d 416 x 416, 3 -> 208 x 208, 32 1 ActivationLeaky 208 x 208, 32 -> 208 x 208, 32 2 Conv2d 208 x 208, 32 -> 104 x 104, 64 3 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 4 Conv2d 104 x 104, 64 -> 104 x 104, 64 5 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 6 Route 104 x 104, 32 -> 104 x 104, 32 7 Conv2d 104 x 104, 32 -> 104 x 104, 32 8 ActivationLeaky 104 x 104, 32 -> 104 x 104, 32 9 Conv2d 104 x 104, 32 -> 104 x 104, 32 10 ActivationLeaky 104 x 104, 32 -> 104 x 104, 32 11 Route 104 x 104, 64 -> 104 x 104, 64 12 Conv2d 104 x 104, 64 -> 104 x 104, 64 13 ActivationLeaky 104 x 104, 64 -> 104 x 104, 64 14 Route 104 x 104, 128 -> 104 x 104, 128 15 Pooling 104 x 104, 128 -> 52 x 52, 128 16 Conv2d 52 x 52, 128 -> 52 x 52, 128 17 ActivationLeaky 52 x 52, 128 -> 52 x 52, 128 18 Route 52 x 52, 64 -> 52 x 52, 64 19 Conv2d 52 x 52, 64 -> 52 x 52, 64 20 ActivationLeaky 52 x 52, 64 -> 52 x 52, 64 21 Conv2d 52 x 52, 64 -> 52 x 52, 64 22 ActivationLeaky 52 x 52, 64 -> 52 x 52, 64 23 Route 52 x 52, 128 -> 52 x 52, 128 24 Conv2d 52 x 52, 128 -> 52 x 52, 128 25 ActivationLeaky 52 x 52, 128 -> 52 x 52, 128 26 Route 52 x 52, 256 -> 52 x 52, 256 27 Pooling 52 x 52, 256 -> 26 x 26, 256 28 Conv2d 26 x 26, 256 -> 26 x 26, 256 29 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 30 Route 26 x 26, 128 -> 26 x 26, 128 31 Conv2d 26 x 26, 128 -> 26 x 26, 128 32 ActivationLeaky 26 x 26, 128 -> 26 x 26, 128 33 Conv2d 26 x 26, 128 -> 26 x 26, 128 34 ActivationLeaky 26 x 26, 128 -> 26 x 26, 128 35 Route 26 x 26, 256 -> 26 x 26, 256 36 Conv2d 26 x 26, 256 -> 26 x 26, 256 37 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 38 Route 26 x 26, 512 -> 26 x 26, 512 39 Pooling 26 x 26, 512 -> 13 x 13, 512 40 Conv2d 13 x 13, 512 -> 13 x 13, 512 41 ActivationLeaky 13 x 13, 512 -> 13 x 13, 512 42 Conv2d 13 x 13, 512 -> 13 x 13, 256 43 ActivationLeaky 13 x 13, 256 -> 13 x 13, 256 44 Conv2d 13 x 13, 256 -> 13 x 13, 512 45 ActivationLeaky 13 x 13, 512 -> 13 x 13, 512 46 Conv2d 13 x 13, 512 -> 13 x 13, 255 47 Yolo 13 x 13, 255 -> 13 x 13, 255 48 Route 13 x 13, 256 -> 13 x 13, 256 49 Conv2d 13 x 13, 256 -> 13 x 13, 128 50 ActivationLeaky 13 x 13, 128 -> 13 x 13, 128 51 Upsample 13 x 13, 128 -> 26 x 26, 128 52 Route 26 x 26, 384 -> 26 x 26, 384 53 Conv2d 26 x 26, 384 -> 26 x 26, 256 54 ActivationLeaky 26 x 26, 256 -> 26 x 26, 256 55 Conv2d 26 x 26, 256 -> 26 x 26, 255 56 Yolo 26 x 26, 255 -> 26 x 26, 255
N params: 6049888 Max feature map size: 2768896 N MACC: 3453938176
GPU free memory: 6136.2 mb. New NetworkRT (TensorRT v8.25) Float16 support: 1 Int8 support: 1 DLAs: 0 Selected maxBatchSize: 1 GPU free memory: 5865.67 mb. Building tensorRT cuda engine... saving serialized network to file 25041160 create execution context Input/outputs numbers: 3 input index = 0 -> output index = 2 Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1 NUMBER OF LAYERS IN NETWORK : 95 NUMBER OF LAYERS IN ENGINE : 57 RtBuffer 0 dim: Data dim: 1 3 416 416 1 RtBuffer 1 dim: Data dim: 1 255 13 13 1 RtBuffer 2 dim: Data dim: 1 255 26 26 1
====== CUDNN inference ====== Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1
===== TENSORRT inference ==== Data dim: 1 3 416 416 1 Data dim: 1 255 26 26 1
=== OUTPUT 0 CHECK RESULTS == CUDNN vs correct | OK ~0.02 TRT vs correct | OK ~0.02 CUDNN vs TRT | OK ~0.02
=== OUTPUT 1 CHECK RESULTS ==
CUDNN vs correct | OK ~0.02
TRT vs correct | OK ~0.02
CUDNN vs TRT | OK ~0.02
0
Steps To Reproduce
git clone https://github.com/ceccocats/tkDNN.git && cd tkDNN && mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Release .. && make test_yolo4tiny && ./test_yolo4tiny (or run any test there , the demoConfig.yaml inside the demo folder might need some modification pointing it to the right rt file) (running any tests with tensorrt 8.4 fails trt vs correct and trt vs cudnn checks fails)
are there any workarounds for fixing this or is this a wsl specific issue ?
Would it be possible for someone to help me out here? I cant seem to figure why tkdnn wont work with tensorrt 8.4 on wsl2 while there are no issues when i run it natively on windows or linux and that this issue doesn't exist on previous versions of tensorrt on wsl2 .
Could you try CUDA-11.6? TRT 8.4 was tested with CUDNN-8.4 + CUDA-11.6
Could you try CUDA-11.6? TRT 8.4 was tested with CUDNN-8.4 + CUDA-11.6
yup the issue still exists ( i also tried it with cuda 11.4 and faced the same issue with TRT 8.4)
I see...
@kevinch-nv Do we have any internal testing for WSLs?
We do not, we only test native Windows and native Ubuntu which both seem to be passing.
What's the use case of using WSL over the native OS?
Honestly, there is no use case of using WSL over native OS , atleast for me, i started playing around with WSL in my free-time and came across this issue,figured I can open an issue to report it if its a bug.I can close this issue if WSL isn't a TensorRT priority.
I can confirm the same error happens to me as well. TensorRT 8.4 is not working properly on wsl2. I compiled the efficientnet python model on a titan xp machine under native ubuntu 20.04 and it works fine. I copied over the compiled trt model from native ubuntu 20.04 to wsl2 and the model just runs wrong silently.
I have same issue, WSL Ubuntu 20.04 with RTX 3090, neither tensorrt 8.4 or 8.2 work
Same issue for me. TensorRT 8.4 on WSL2 gives rubbish results on model inference. See my post:
https://forums.developer.nvidia.com/t/onnx-tensorrt-inference-gives-wrong-result/230625
Can I run TensorRT 8.4 on VM Ubuntu?
Would it be possible for someone to help me out here? I cant seem to figure why tkdnn wont work with tensorrt 8.4 on wsl2 while there are no issues when i run it natively on windows or linux and that this issue doesn't exist on previous versions of tensorrt on wsl2 .
@perseusdg do you know which other version of tensorrt works with WSL2?
Pretty much all versions of tensorrt up-to tensorrt 8.2 worked on WSL2 for me ,but I had to copy all lib files from /usr/lib/wsl/lib to /usr/lib/x86-_64-linux-gnu for it to work properly. As for running it on vm with Ubuntu ,if your vm has direct gpu access(passthrough) ,tensorrt should run properly
Pretty much all versions of tensorrt up-to tensorrt 8.2 worked on WSL2 for me ,but I had to copy all lib files from /usr/lib/wsl/lib to /usr/lib/x86-_64-linux-gnu for it to work properly. As for running it on vm with Ubuntu ,if your vm has direct gpu access(passthrough) ,tensorrt should run properly
when i downgrade to tensorrt 8.2, it works for me.
But TensorRT 8.4 on WSL2 still not working.
TensorRT 8.5 is released. I wonder if it would work with WSL2.
Yup its working now with tensorrt 8.5.2
I upgraded to tensorRT 8.5.3, and the previous issue is gone now.