Sanjib
Sanjib
Export: `yolo export model=".\best.pt" format="torchscript" imgsz=1600 dynamic=False device=0 batch=12` I will test TensorRT, but why does it behave so negatively when batching? Forward pass: ``` std::vector inputs{ tensor_imgs }; if...
Tested with tensorRT. Got the following stats. Really curious why negative batching performance? GPU metrics show ~80%(avg) utilization with ~78%(avg) SM occupancy. | Batch Size | Forward Pass (ms/image) |...
Average. Dataset size: 248 images; iterate 10 times after 5 dummy warmup passes.
Following is the result of running TensorRT in half precision. | Batch Size (Images) | Forward Pass (ms/image) | Total Time (ms) | Throughput (img/s) | |------------|----------------------------|------------------|----------------------| | 1 |...
Thanks for your suggestions. I tested with `dynamic = false` as well and got the same performance. I will test the other suggestions too, but I need some time. I’m...
I used Nsight Compute. The following are the stats: Batch-1: DRAM bytes: 16.71 MB DRAM read: 12.69 MB DRAM write: 4.02 MB GPU time: 0.162 ms Bandwidth used: 16.71 MB...
windows, cuda12.4, c++
I did build too using cuda12.4 on windowns10, but needed patch a couple of files. It needs cleanup for windows supports.
@mnorris11 Done!
> Hey @Sanjib-ac, can you take a look at this again? Maybe update it or close it if you're no longer working on it. I’m not working on it anymore...