tkDNN TensorRT inference doesn't work after deserialisation

Inference works with Cudnn and works with TensorRT on first creation of the network. However, TensorRT doesn't produce the correct output after deserialisation. The code executes fine, but the data output isn't right (and doesn't change when the input changes). Number and size of input/output buffers is correct. I'm using a Yolo4 network.

I haven't really got any experience directly programming GPU code so I'm not sure where to take debugging efforts next. I think this was the problem highlighted on this thread: https://github.com/ceccocats/tkDNN/issues/28

I'm using Windows 10, TensorRT 7.0.0.11, Cuda 10.2, Cudnn 7.605 Thanks

Jun 22 '20 15:06 JoeCool90

I'm wondering if there's an issue with the saved .rt file. Binary mode stopped the crashing but the file is over twice as big as the yolov4 weights it is built from. The weights are being imported ok, because the rt version runs fine on the first usage (before using the deserialised form).

I debugged as much as I could and saw the createPlugin function being called during the deserialisation and the config seems to be coming through fine. I also made sure my input and output buffers were set with a known value so I could see inference was setting them to something, albeit the wrong values.

Jun 23 '20 10:06 JoeCool90

Hi Joe, unfortunately we do not use Windows and on Ubuntu we never had problems at all. Maybe I can check the serialisation functions but I can't check on windows right now.

Jun 23 '20 11:06 ceccocats

Is there any more I can do to debug on my end? I've done the obvious but don't know if tensorrt gives any support to debugging serialisation/deserialisation.

I'm wondering if for some reason data is being written as doubles rather than floats, because of the file size, but I don't see how that could have crept in.

Jun 23 '20 12:06 JoeCool90

I'm wondering if the issue could be related to this:

https://github.com/NVIDIA/TensorRT/issues/178#issuecomment-547763304

With layers being created with new rather than via the factory in the convert_layer functions? Perhaps it worked in TensorRT 6 but doesn't in version 7?

Jun 23 '20 17:06 JoeCool90

You can activate the debugger With:

cmake .. -DDEBUG=True
make

But I don't think it can help you to much

We tested on: Jetson tx2, Jetson xavier, Jetson nano with the latest jetpack. And in various pc with a lot of different GPUs. But always with Ubuntu 16-18-20

Jun 23 '20 19:06 ceccocats

@JoeCool90 Hi

I have the same problem, Could you solve it?

Jun 24 '20 05:06 sctrueew

I've started on rewriting a yolo v4 tensorRT implementation for my project. If I have success I'll try and report my conclusions. I'm thinking it may be a issue of compatibility with the way the recent versions of tensorRT prefer things and that it is finally breaking with version 7. Looking at the sample code from nvidia some elements look quite a bit different (e.g. use of IPluginV2Ext rather than IPlugin).

Might take me a couple of days though.

Jun 24 '20 11:06 JoeCool90

If you're in a rush, you might want to try using an older version of TensorRT. Maybe that will work.

Jun 24 '20 11:06 JoeCool90

@JoeCool90 Hi,

I've tested with the older version of tensorRT but didn't work.

Jun 27 '20 17:06 sctrueew

If you're in a rush, you might want to try using an older version of TensorRT. Maybe that will work.

As far as getting the right output is concerned, TensorRT v6.01 didn't do the trick.

Jul 03 '20 08:07 jstumpin

I've started on rewriting a yolo v4 tensorRT implementation for my project. If I have success I'll try and report my conclusions. I'm thinking it may be a issue of compatibility with the way the recent versions of tensorRT prefer things and that it is finally breaking with version 7. Looking at the sample code from nvidia some elements look quite a bit different (e.g. use of IPluginV2Ext rather than IPlugin).

Might take me a couple of days though.

You can use https://github.com/CaoWGG/TensorRT-YOLOv4 as reference, working fine (even faster than NVIDIA's original repo) on Windows. Only thing though, batch inference isn't supported.

Jul 03 '20 08:07 jstumpin

Thanks for the two links. I've been rewriting an implementation for myself based on tkDnn, https://github.com/wang-xinyu/tensorrtx, and tensorRT samples.

Everything seems to be moving so fast that code you're looking at one week, has changed the next. I'm working with tensorRT 7.

I'll use those two links as reference too and see if I can get it all working. I'm nearly finished at an initial re-write but I think I'm going to pause and see if I can test a layer at a time, with reference to the tkdnn cuda implementation - just so I can know I haven't introduced bugs.

The main difference between what I'm doing (and what tensorrtx has) and tkdnn is the use of the more recent IPluginCreator and IPluginV2IOExt methods. As far as I can tell.

The problem with tensorrtx is that it is hard coded so doesn't adapt to your cfg file. So I'm basing my parsing on tkdnn. The yolov4 sources are also being updated - e.g. the mish activation function has been changed slightly. So I'm trying to be as up to date as possible.

A big thanks to people like ceccocats for creating these libraries.

Jul 03 '20 14:07 JoeCool90

So far so good. I've started got the first convolution layer done (I've rewritten a lot of the parser, but only tested this so far) and tested to give the same output as the tkdnn cuda output. It works, and I can serialise/deserialise fine.

One thing I noticed when building the network is that you need to hold onto the data for the weights, etc. You can't just call the network->addConvolutionNd function for example and then release the weights data. You won't get an errors for the functions, but the network will produce a fairly blank output. Although this isn't overly surprising I'm not sure if its documented anywhere.

The reason I say this, is that when using the deserialised tkdnn network, I get a junk output that reminds me of the patterns I was seeing when not holding onto weights data. I'm wondering if this might be the problem. I'm not sure.

Yolo v4 now uses an updated version of the mish activation function but the difference is minimal - as it should be I guess.

Jul 06 '20 22:07 JoeCool90

@JoeCool90 two more repos for your reference (untested myself):

https://github.com/enazoe/yolo-tensorrt
https://github.com/opencv/opencv/issues/17795#issuecomment-656553410

Jul 17 '20 04:07 jstumpin

Thanks @jstumpin

I got my code working, so far 32bit single and batched modes (batched gives about 97fps on a 2070 Super for 416x416 yolo v4 - and yes, I could do with retraining my network to 512x512). I'm not planning to post up my code as its a bit messy and customised to my needs but I can post if people really need it.

The serialised file is a similar size to the one by tkdnn. TensorRT must have a lot of overhead in its serialisation because the file is almost twice the size of the original weights file.

I'm guessing the bug with tkdnn is some sort of memory error when deserialising, but I haven't spotted it yet. When coding my own version, a couple of times I had memory bugs and it caused similar issues (when doing the inference, getting the same set of wrong values out of a layer). For example, when building the network, if you pass weights to an add layer function, it doesn't copy them internally, so you have to keep that memory live until you build the network. Tkdnn has this dual cudnn/tensort thing going on and I'd bet the memory leak is happening in there somewhere.

The other difference is that tkdnn uses an old method for the custom layers. I, like tensorrtx, use IPluginV2IOExt, etc... But I haven't seen that the older methods have been deprecated, and since they work when building I'm not sure that this is the problem.

All of the distros are out of date when compared to the yolo v4 source. Since that's been updated at an insane pace I'm not surprised. I don't think its a big issue, just things like minor mods to transfer functions for speed purposes, or the sorting/merging methods for the detections. Tensorrtx is missing the scaling function in the yolo layers. I would say tkdnn gives a decent reference.

Jul 23 '20 11:07 JoeCool90

@JoeCool90 I'm looking to test trt 7.2.1 yolov4 on windows. Any help to getting this repo working/the code you have written would be greatly appreciated.

Dec 17 '20 14:12 ttdd11

@JoeCool90 I faced similar issue. I tried deserializing the rt file created from tkdnn using a sample pythonAPI. [TensorRT] ERROR: deserializationUtils.cpp (528) - Serialization Error in load: 0 (Serialized engine contains plugin, but no plugin factory was provided. To deserialize an engine without a factory, please use IPluginV2 instead Is there anyway to solve this error? I used different trt versions [6.0.1, 7.1.3, 7.0.0, 7.2.1]. I create rt files with all these versions and unable to deserialize using same trt versions. I am not that familiar with tensorrt, so any leads would be great! thanks!

Mar 12 '21 03:03 bobbilichandu

tkDNN is now supported on windows 10 on the current master branch at an experimental level with some caveats

Apr 24 '21 18:04 perseusdg