Discussion: Model Inference Optimization Techniques for Real-Time Streaming Pipeline
- Hardware-Accelerated Video Decoding
When enabled ffmpeg features with hardware acceleration support, the DataLoader's decoder should prioritize hardware-accelerated backends (eg: nvdec/cuvid for NVIDIA GPUs, qsv for Intel GPUs).
As an example, consider using rsmedia, which provides hardware-accelerated decoding and encoding.
This is a modifited i made to Dataloader to support hardware acceleration, only support cuda for now. This is a commit:
usls Dataloader support Decoder Hardware acceleration commit
- Optimizing
model.forwardSpeed with Underutilized GPU
I’ve noticed that GPU resources are significantly underutilized, but the inference speed remains very slow. What optimization strategies can I apply?
This is my code example:
fn main() -> Result<()> {
let options = args::build_options()?;
// build model
let mut model = YOLO::try_from(options.commit()?)?;
// build dataloader
let dl = DataLoader::new(&args::input_source())?
.with_batch(model.batch() as _)
.with_device(Device::Cuda(0))
.build()?;
// build annotator
let annotator = Annotator::default()
.with_skeletons(&usls::COCO_SKELETONS_16)
.without_masks(true)
.with_bboxes_thickness(3)
.with_saveout(model.spec());
let mut position = Time::zero();
let duration: Time = Time::from_nth_of_a_second(30);
let mut encoder = EncoderBuilder::new(std::path::Path::new(&args::output()), 1920, 1080)
.with_format("flv")
.with_codec_name("h264_nvenc".to_string())
.with_hardware_device(HWDeviceType::CUDA)
.with_options(&Options::preset_h264_nvenc())
.build()?;
// run & annotate
for (xs, _paths) in dl {
let ys = model.forward(&xs)?;
// extract bboxes
for y in ys.iter() {
if let Some(bboxes) = y.bboxes() {
println!("[Bboxes]: Found {} objects", bboxes.len());
for (i, bbox) in bboxes.iter().enumerate() {
println!("{}: {:?}", i, bbox)
}
}
}
// plot
let frames = annotator.plot(&xs, &ys, false)?;
// encode
for (i, img) in frames.iter().enumerate() {
// save image if needed
img.save(format!("/tmp/images/{}_{}.png", string_now("-"), i))?;
// image -> AVFrame
let raw_frame = RawFrame::try_from_cv(&img.to_rgb8())?;
// realtime streaming encoding
encoder.encode_raw(&raw_frame)?;
// Update the current position and add the inter-frame duration to it.
position = position.aligned_with(duration).add()
}
}
model.summary();
encoder.finish().expect("failed to finish encoder");
Ok(())
}
- End-to-End Pipeline with YOLO Detection + Hardware-Accelerated Encoding
Workflow: YOLO model detection → bounding box rendering → real-time streaming via. (eg. NVIDIA nvenc)
Consideration should be given to how to achieve resource efficiency,and Real-time streaming to ensure smooth, stable and clear picture quality.
Hardware acceleration sounds like a great solution! I’ll check out the suggestions and code you provided as soon as possible—thank you so much!
Time consuming statistics:
2025-03-03T15:17:05.761789748+08:00 INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067834104s, min=1.039388029s, max=1.098278046s
Annotation: avg=440.746987ms, min=356.098842ms, max=487.375932ms
Encoding: avg=114.577614ms, min=86.224038ms, max=156.673274ms
Batch sender time: 372.879µs
2025-03-03T15:17:09.516312610+08:00 INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.070150794s, min=1.039388029s, max=1.098278046s
Annotation: avg=438.701176ms, min=356.098842ms, max=487.375932ms
Encoding: avg=117.700642ms, min=86.224038ms, max=156.673274ms
Batch sender time: 289.202µs
2025-03-03T15:17:13.308906499+08:00 INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.072397981s, min=1.039388029s, max=1.098278046s
Annotation: avg=433.072846ms, min=356.098842ms, max=487.375932ms
Encoding: avg=123.453407ms, min=86.224038ms, max=157.194381ms
Batch sender time: 235.147µs
2025-03-03T15:17:17.364376789+08:00 INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.066150328s, min=1.022416759s, max=1.098278046s
Annotation: avg=428.853764ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.870734ms, min=86.224038ms, max=157.194381ms
Batch sender time: 371.402µs
2025-03-03T15:17:21.137066568+08:00 INFO ThreadId(01) yolo_vision: 146: Performance stats after 1 batches:
Inference: avg=1.067140058s, min=1.022416759s, max=1.098278046s
Annotation: avg=430.064528ms, min=356.098842ms, max=487.375932ms
Encoding: avg=122.90406ms, min=86.224038ms, max=157.194381ms
Batch sender time: 309.676µs
......
The most time-consuming phases are Inference and annotation, which are roughly 1.07s and 430ms . This time is taking too long for real-time push streaming. I would like to ask what is a good optimization solution for this situation.
From the results of your code execution, it appears that model inference indeed occupies a significant amount of time, and the annotate process also takes up a considerable amount of time. Here are several aspects for analysis:
-
- Hardware and Model Details: What are your machine and GPU models? What is the size or parameter count of the YOLO model used for inference, and what is the batch size? Based on experience, with an RTX 3060Ti, a batch size of 1, and an input image resolution of 640x640, the
YOLOv8-m-detmodel preprocessing time is approximately 1.5ms, model inference takes less than 20ms, and post-processing varies depending on the number of results and the CPU performance of the machine, typically requiring around 600µs. If the inference model is in ONNX format, you could try using theTensorRTprovider andFP16precision for further acceleration.
- Hardware and Model Details: What are your machine and GPU models? What is the size or parameter count of the YOLO model used for inference, and what is the batch size? Based on experience, with an RTX 3060Ti, a batch size of 1, and an input image resolution of 640x640, the
-
- Annotator Performance: The
plot()orannotate()methods of the annotator do not implement parallel strategies and rely heavily on CPU performance. The current implementation uses theimageproccrate for rendering results, which is somewhat slow. If speed is a priority, you could consider experimenting with other crate for result rendering.
- Annotator Performance: The
-
- DataLoader Considerations: After enabling hardware acceleration with an Nvidia device, have you tested the time taken for video stream decoding and encoding? These results could help analyze the Encoder’s performance and the time consumption of each iteration in the
forloop.
- DataLoader Considerations: After enabling hardware acceleration with an Nvidia device, have you tested the time taken for video stream decoding and encoding? These results could help analyze the Encoder’s performance and the time consumption of each iteration in the
I have been traveling on business recently and do not have access to a computer to test whole pipeline and rsmedia crate. Sorry for not being able to respond to your questions in a timely manner. You are welcome to leave further comments for discussion, and I will reply as soon as I see them.
I tried the rsmedia project, and it seems that there are some issues with the ffmpeg6 features.
Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554)
error[E0605]: non-primitive cast: `unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c}` as `unsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32`
--> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50
|
148 | write_packet.is_some().then_some(write_c as _),
| ^^^^^^^^^^^^ invalid cast
For more information about this error, try `rustc --explain E0605`.
error: could not compile `rsmpeg` (lib) due to 1 previous error
I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3
I tried the
rsmediaproject, and it seems that there are some issues with theffmpeg6features.Compiling rsmpeg v0.15.1+ffmpeg.7.0 (https://github.com/phial3/rsmpeg?branch=light#13f8c554) error[E0605]: non-primitive cast:
unsafe extern "C" fn(*mut c_void, *const u8, i32) -> i32 {write_c}asunsafe extern "C" fn(*mut c_void, *mut u8, i32) -> i32--> /home/qweasd/.cargo/git/checkouts/rsmpeg-6e0a08a626b70a61/13f8c55/src/avformat/avio.rs:148:50 | 148 | write_packet.is_some().then_some(write_c as _), | ^^^^^^^^^^^^ invalid castFor more information about this error, try
rustc --explain E0605. error: could not compilersmpeg(lib) due to 1 previous error I see that this project is under rapid development. I will keep following it and wait for further testing. @phial3
the default feature is ["ffmpeg7", "ndarray"]
If you use ffmpeg version 7.x, the default is ok.
like this:
rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg" }
If you use ffmpeg version 6.x, the default feature need to be close. like this:
rsmedia = { git = "https://github.com/phial3/rsmedia", branch = "rsmpeg", default-features = false, features = ["ffmpeg6", "ndarray"] }