usls Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline

Hardware-Accelerated Video Decoding

When enabled ffmpeg features with hardware acceleration support, the DataLoader's decoder should prioritize hardware-accelerated backends (eg: nvdec/cuvid for NVIDIA GPUs, qsv for Intel GPUs).

As an example, consider using rsmedia, which provides hardware-accelerated decoding and encoding.

This is a modifited i made to Dataloader to support hardware acceleration, only support cuda for now. This is a commit：

usls Dataloader support Decoder Hardware acceleration commit

Optimizing model.forward Speed with Underutilized GPU

I’ve noticed that GPU resources are significantly underutilized, but the inference speed remains very slow. What optimization strategies can I apply?

This is my code example：

fn main() -> Result<()> {
    let options = args::build_options()?;

    // build model
    let mut model = YOLO::try_from(options.commit()?)?;

    // build dataloader
    let dl = DataLoader::new(&args::input_source())?
        .with_batch(model.batch() as _)
        .with_device(Device::Cuda(0))
        .build()?;

    // build annotator
    let annotator = Annotator::default()
        .with_skeletons(&usls::COCO_SKELETONS_16)
        .without_masks(true)
        .with_bboxes_thickness(3)
        .with_saveout(model.spec());

    let mut position = Time::zero();
    let duration: Time = Time::from_nth_of_a_second(30);

    let mut encoder = EncoderBuilder::new(std::path::Path::new(&args::output()), 1920, 1080)
        .with_format("flv")
        .with_codec_name("h264_nvenc".to_string())
        .with_hardware_device(HWDeviceType::CUDA)
        .with_options(&Options::preset_h264_nvenc())
        .build()?;

    // run & annotate
    for (xs, _paths) in dl {

        let ys = model.forward(&xs)?;

        // extract bboxes
        for y in ys.iter() {
            if let Some(bboxes) = y.bboxes() {
                println!("[Bboxes]: Found {} objects", bboxes.len());
                for (i, bbox) in bboxes.iter().enumerate() {
                    println!("{}: {:?}", i, bbox)
                }
            }
        }

        // plot
        let frames = annotator.plot(&xs, &ys, false)?;

        // encode
        for (i, img) in frames.iter().enumerate() {
            // save image if needed
            img.save(format!("/tmp/images/{}_{}.png", string_now("-"), i))?;

            // image -> AVFrame
            let raw_frame = RawFrame::try_from_cv(&img.to_rgb8())?;
        
            // realtime streaming encoding
            encoder.encode_raw(&raw_frame)?;

            // Update the current position and add the inter-frame duration to it.
            position = position.aligned_with(duration).add()
        }
    }

    model.summary();

    encoder.finish().expect("failed to finish encoder");

    Ok(())
}

End-to-End Pipeline with YOLO Detection + Hardware-Accelerated Encoding

Workflow: YOLO model detection → bounding box rendering → real-time streaming via. (eg. NVIDIA nvenc)

Consideration should be given to how to achieve resource efficiency，and Real-time streaming to ensure smooth, stable and clear picture quality.

Mar 01 '25 12:03 phial3

Hardware acceleration sounds like a great solution! I’ll check out the suggestions and code you provided as soon as possible—thank you so much!

Mar 01 '25 14:03 jamjamjon

Time consuming statistics:

2025-03-03T15:17:05.761789748+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067834104s, min=1.039388029s, max=1.098278046s
Annotation: avg=440.746987ms, min=356.098842ms, max=487.375932ms
Encoding: avg=114.577614ms, min=86.224038ms, max=156.673274ms
Batch sender time: 372.879µs

2025-03-03T15:17:09.516312610+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.070150794s, min=1.039388029s, max=1.098278046s
Annotation: avg=438.701176ms, min=356.098842ms, max=487.375932ms
Encoding: avg=117.700642ms, min=86.224038ms, max=156.673274ms
Batch sender time: 289.202µs

2025-03-03T15:17:13.308906499+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.072397981s, min=1.039388029s, max=1.098278046s
Annotation: avg=433.072846ms, min=356.098842ms, max=487.375932ms
Encoding: avg=123.453407ms, min=86.224038ms, max=157.194381ms
Batch sender time: 235.147µs

2025-03-03T15:17:17.364376789+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.066150328s, min=1.022416759s, max=1.098278046s
Annotation: avg=428.853764ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.870734ms, min=86.224038ms, max=157.194381ms
Batch sender time: 371.402µs

2025-03-03T15:17:21.137066568+08:00  INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067140058s, min=1.022416759s, max=1.098278046s
Annotation: avg=430.064528ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.90406ms, min=86.224038ms, max=157.194381ms
Batch sender time: 309.676µs

......

The most time-consuming phases are Inference and annotation, which are roughly 1.07s and 430ms . This time is taking too long for real-time push streaming. I would like to ask what is a good optimization solution for this situation.

Mar 03 '25 07:03 phial3

From the results of your code execution, it appears that model inference indeed occupies a significant amount of time, and the annotate process also takes up a considerable amount of time. Here are several aspects for analysis:

1. Hardware and Model Details: What are your machine and GPU models? What is the size or parameter count of the YOLO model used for inference, and what is the batch size? Based on experience, with an RTX 3060Ti, a batch size of 1, and an input image resolution of 640x640, the YOLOv8-m-det model preprocessing time is approximately 1.5ms, model inference takes less than 20ms, and post-processing varies depending on the number of results and the CPU performance of the machine, typically requiring around 600µs. If the inference model is in ONNX format, you could try using the TensorRT provider and FP16 precision for further acceleration.
1. Annotator Performance: The plot() or annotate() methods of the annotator do not implement parallel strategies and rely heavily on CPU performance. The current implementation uses the imageproc crate for rendering results, which is somewhat slow. If speed is a priority, you could consider experimenting with other crate for result rendering.
1. DataLoader Considerations: After enabling hardware acceleration with an Nvidia device, have you tested the time taken for video stream decoding and encoding? These results could help analyze the Encoder’s performance and the time consumption of each iteration in the for loop.

I have been traveling on business recently and do not have access to a computer to test whole pipeline and rsmedia crate. Sorry for not being able to respond to your questions in a timely manner. You are welcome to leave further comments for discussion, and I will reply as soon as I see them.

Mar 04 '25 00:03 jamjamjon

I tried the rsmedia project, and it seems that there are some issues with the ffmpeg6 features.

Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554)
error[E0605]: non-primitive cast: `unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c}` as `unsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32`
   --> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50
    |
148 |                 write_packet.is_some().then_some(write_c as _),
    |                                                  ^^^^^^^^^^^^ invalid cast

For more information about this error, try `rustc --explain E0605`.
error: could not compile `rsmpeg` (lib) due to 1 previous error

I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3

Mar 11 '25 13:03 jamjamjon

I tried the rsmedia project, and it seems that there are some issues with the ffmpeg6 features.

Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554) error[E0605]: non-primitive cast: unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c} as unsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32 --> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50 | 148 | write_packet.is_some().then_some(write_c as _), | ^^^^^^^^^^^^ invalid cast

For more information about this error, try rustc --explain E0605. error: could not compile rsmpeg (lib) due to 1 previous error I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3

the default feature is ["ffmpeg7", "ndarray"] If you use ffmpeg version 7.x, the default is ok. like this:

rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg" }

If you use ffmpeg version 6.x, the default feature need to be close. like this:

rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg", default-features = false, features = ["ffmpeg6", "ndarray"] }

Mar 12 '25 00:03 phial3

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline​

Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline