yolo_deepstream icon indicating copy to clipboard operation
yolo_deepstream copied to clipboard

Performance report for tensorrt_yolov4

Open jstumpin opened this issue 3 years ago • 3 comments

As an extension to the preliminary benchmark for tensorrt_yolov4, batch inference performance is provided as follows:

repo. batch=1 batch=2 batch=4 batch=8
tkDNN N/A (N/A) 207.81 N/A N/A (N/A) 443.32 N/A
isarsoft 7.96 (N/A) 125.4 N/A 21.0 (N/A) 189.6 38.3 (N/A) 208.0
this 7.023 (2.61747) 120.831 4.393 (1.76344) 186.44 3.688 (1.26853) 223.68 3.42267 (0.888971) 239.063

where the metrics are formatted as: wall-time in ms (standard deviation of wall-time) frame-per-second. Wall-time only considers pre-processing + inference + post-processing times, while FPS is calculated based on end-to-end process; from image acquisition to image overlays without display.

For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion. While all repositories are based on 320x320 input size and FP16 precision, the accompanying repositories are not to be directly compared as each is having its own metrics. More so, both are using NVIDIA GeForce RTX 2080 Ti, where as for this repository, I'm using NVIDIA GeForce RTX 2070.

jstumpin avatar Jan 14 '21 00:01 jstumpin

@jstumpin Hi,

Are all these resluts for 320x320?


3.42267 (0.888971) 239.063

239 FPS with batch=8, it means 239/8 ~= 30 batches per 1 second, it means that latency can't be less than 30 ms ~= 1000/30.


For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion.

What do you mean? There are results for FP16 and FP32: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti

AlexeyAB avatar Jan 14 '21 02:01 AlexeyAB

@AlexeyAB Hi,

Are all these resluts for 320x320?

3.42267 (0.888971) 239.063

Indeed they are.

239 FPS with batch=8, it means 239/8 ~= 30 batches per 1 second, it means that latency can't be less than 30 ms ~= 1000/30.

3.42267 wall-time is derived from the average of 3000-frames over the average of a single batch@8-frames, e.g.:

int batchSize = 8;
auto infer_start = std::chrono::steady_clock::now();
auto detections = infer(d_frames);
auto infer_end = std::chrono::steady_clock::now(); 
float infer_diff = std::chrono::duration_cast<std::chrono::milliseconds>(infer_end - infer_start).count();
avg_infer_times.push_back(infer_diff / batchSize);

therefore, avg_infer_times / 3000 = 3.42267

239.063 FPS follows the same route except that it includes frame/input grabbing and frame/output overlaying, e.g.:

int batchSize = 8;
auto total_start = std::chrono::steady_clock::now();
d_reader->nextFrame(d_frame);
d_frames.push_back(d_frame.clone());
counter++;
if (counter == batchSize) 
{
    counter = 0;
    ...
    auto detections = infer(d_frames);
    ...
    for (int b = 0; b < batchSize; ++b) 
    {
        d_frames[b].download(frame);
        for (auto detection : detections[b])
            draw(detection.classId, detection.confidence, detection.left, detection.top, detection.right, detection.bottom, frame);
    }
    auto total_end = std::chrono::steady_clock::now();
    float total_diff = std::chrono::duration_cast<std::chrono::milliseconds>(total_end - total_start).count();
    avg_total_fps.push_back(1000 / (total_diff / batchSize));
}

therefore, avg_total_fps / 3000 = 239.063

Thus latency is not considered.

For fairness, AlexeyAB's does not include FP16 numbers hence the exclusion.

What do you mean? There are results for FP16 and FP32: https://github.com/AlexeyAB/darknet#geforce-rtx-2080-ti

Second column seems like an FP32 performance numbers. The rest are third-party repos.

jstumpin avatar Jan 14 '21 03:01 jstumpin

@jstumpin I am using RTX-3070 gpu with 8Gb memory for running Yolov4 model on TensorRT with fp16 precision. I've obtained 135fps on average with the preprocessing kernel function implemented by CaoWGG/TensorRT-YOLOv4 (no batch). But, with this github implementation (post & pre-processing), I've obtained about 40fps per a batch when batch_size=4 which gives me a total of 160fps. In your result table, the speed of batch processing (223.68 fps) when batch_size=4 is almost 2x times higher than that of batch_size=1 (120.831 fps). I wonder why it is much slower in my case.

spacewalk01 avatar Mar 03 '21 02:03 spacewalk01