openvino
openvino copied to clipboard
[Performance]: High overhead latency for ov::InferRequest::infer()
OpenVINO Version
2024.0.0
Operating System
Ubuntu 20.04 (LTS)
Device used for inference
CPU
OpenVINO installation
Build from source
Programming Language
C++
Hardware Architecture
x86 (64 bits)
Model used
ReduceSum
Model quantization
No
Target Platform
~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 33
Model name: AMD Ryzen 9 5950X 16-Core Processor
Stepping: 0
Frequency boost: enabled
CPU MHz: 2200.000
CPU max MHz: 3400.0000
CPU min MHz: 2200.0000
BogoMIPS: 6787.36
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 8 MiB
L3 cache: 64 MiB
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave
avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfct
r_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rds
eed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpr
u wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spe
c_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
~$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 3400.0000 2200.0000
1 0 0 1 1:1:1:0 yes 3400.0000 2200.0000
2 0 0 2 2:2:2:0 yes 3400.0000 2200.0000
3 0 0 3 3:3:3:0 yes 3400.0000 2200.0000
4 0 0 4 4:4:4:0 yes 3400.0000 2200.0000
5 0 0 5 5:5:5:0 yes 3400.0000 2200.0000
6 0 0 6 6:6:6:0 yes 3400.0000 2200.0000
7 0 0 7 7:7:7:0 yes 3400.0000 2200.0000
8 0 0 8 8:8:8:1 yes 3400.0000 2200.0000
9 0 0 9 9:9:9:1 yes 3400.0000 2200.0000
10 0 0 10 10:10:10:1 yes 3400.0000 2200.0000
11 0 0 11 11:11:11:1 yes 3400.0000 2200.0000
12 0 0 12 12:12:12:1 yes 3400.0000 2200.0000
13 0 0 13 13:13:13:1 yes 3400.0000 2200.0000
14 0 0 14 14:14:14:1 yes 3400.0000 2200.0000
15 0 0 15 15:15:15:1 yes 3400.0000 2200.0000
16 0 0 0 0:0:0:0 yes 3400.0000 2200.0000
17 0 0 1 1:1:1:0 yes 3400.0000 2200.0000
18 0 0 2 2:2:2:0 yes 3400.0000 2200.0000
19 0 0 3 3:3:3:0 yes 3400.0000 2200.0000
20 0 0 4 4:4:4:0 yes 3400.0000 2200.0000
21 0 0 5 5:5:5:0 yes 3400.0000 2200.0000
22 0 0 6 6:6:6:0 yes 3400.0000 2200.0000
23 0 0 7 7:7:7:0 yes 3400.0000 2200.0000
24 0 0 8 8:8:8:1 yes 3400.0000 2200.0000
25 0 0 9 9:9:9:1 yes 3400.0000 2200.0000
26 0 0 10 10:10:10:1 yes 3400.0000 2200.0000
27 0 0 11 11:11:11:1 yes 3400.0000 2200.0000
28 0 0 12 12:12:12:1 yes 3400.0000 2200.0000
29 0 0 13 13:13:13:1 yes 3400.0000 2200.0000
30 0 0 14 14:14:14:1 yes 3400.0000 2200.0000
31 0 0 15 15:15:15:1 yes 3400.0000 2200.0000
Performance issue description
Clarification: I did not install from source. I installed with apt from https://apt.repos.intel.com/openvino/2024 but this is not a dropdown option in "OpenVINO installation".
I tested inference speed for a 2-element ReduceSum in C++. The inference time was 28k cycles, which is roughly 8µs on my machine. The ReduceSum computation itself should take only a single-digit number of cycles. This level of latency overhead is not suitable for low-latency applications. Is this intended? Am I missing something? If not, please improve the latency overhead.
Step-by-step reproduction
- Copy make-reducesum.py and openvino-slow.cpp to a new directory and cd to it.
- Generate ReduceSum ONNX model by running
make-reducesum.pywith necessary pip packages installed, including onnx and openvino. (The output files get written to/tmp.) - Compile openvino-slow.cpp with
g++ -O3 -o openvino-slow openvino-slow.cpp -lopenvino - Run the resulting executable by running
taskset -c 3 ./openvino-slow. Thetaskset -c 3is to pin the thread to Core 3 and avoid useless kernel-initiated core-switching that can hurt performance. This assumes that on the system, Core 3 is relatively idle. - See the large number of clock cycles reported in the last output line.
make-reducesum.py
#!/bin/env python
import numpy as np
import onnx
from onnx import helper, TensorProto
import openvino as ov
def main():
graph = helper.make_graph(
name='SimpleModel',
nodes=[helper.make_node("ReduceSum", inputs=['x'], name='y', outputs=['y'])],
inputs=[
helper.make_tensor_value_info('x', TensorProto.FLOAT, [2]),
],
outputs=[
helper.make_tensor_value_info('y', TensorProto.FLOAT, None),
]
)
onnx_model = helper.make_model(graph, producer_name='simple_model')
onnx.save(onnx_model, '/tmp/reducesum.onnx')
ov_model = ov.convert_model('/tmp/reducesum.onnx')
ov.save_model(ov_model, '/tmp/reducesum.xml')
main()
openvino-slow.cpp
#include <cstdint>
#include <iostream>
#include <openvino/openvino.hpp>
using namespace std;
inline uint64_t __attribute__((always_inline)) rdtscAcq(void) {
unsigned cyclesLow, cyclesHigh;
asm volatile(
"CPUID\n\t"
"RDTSC\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
: "=r"(cyclesHigh), "=r"(cyclesLow)::"%rax", "%rbx", "%rcx", "%rdx");
return ((uint64_t)cyclesHigh << 32) | cyclesLow;
}
inline uint64_t __attribute__((always_inline)) rdtscRel(void) {
unsigned cyclesLow, cyclesHigh;
asm volatile(
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t"
: "=r"(cyclesHigh), "=r"(cyclesLow)::"%rax", "%rbx", "%rcx", "%rdx");
return ((uint64_t)cyclesHigh << 32) | cyclesLow;
}
int main() {
ov::Core core;
string model_path = "/tmp/reducesum.xml";
shared_ptr<ov::Model> model = core.read_model(model_path);
auto ops = model->get_ordered_ops();
cout << "OPS: ";
for (auto const& op : ops)
cout << ' ' << op->description();
cout << '\n';
for (auto const& input : model->inputs()) {
cout << "INPUT";
for (auto const& name : input.get_names())
cout << ' ' << name;
cout << " (";
bool first = true;
for (auto dim : input.get_shape()) {
if (first)
cout << dim;
else
cout << ", " << dim;
first = false;
}
cout << ")\n";
}
for (auto const& output : model->outputs()) {
cout << "OUTPUT";
for (auto const& name : output.get_names())
cout << ' ' << name;
cout << " (";
bool first = true;
for (auto dim : output.get_shape()) {
if (first)
cout << dim;
else
cout << ", " << dim;
first = false;
}
cout << ")\n";
}
ov::CompiledModel compiled_model = core.compile_model(
model,
ov::inference_num_threads(1),
ov::num_streams(1),
ov::enable_profiling(false),
ov::hint::performance_mode(ov::hint::PerformanceMode::LATENCY)
);
{
cout << "COMPILED MODEL PROPS\n";
auto supported_properties = compiled_model.get_property(ov::supported_properties);
for (auto&& property : supported_properties) {
if (property != ov::supported_properties.name()) {
cout << "\t\t" << (property.is_mutable() ? "Mutable: " : "Immutable: ") << property << " : ";
auto props = compiled_model.get_property(property);
props.print(cout);
cout << '\n';
}
}
}
ov::InferRequest req = compiled_model.create_infer_request();
// Warmup.
for (int i = 0; i < 10; ++i)
req.infer();
auto startTime = rdtscAcq();
constexpr static int kNumIters = 2000;
for (int i = 0; i < kNumIters; ++i)
req.infer();
auto elapsedCyclesPerIter = (rdtscRel() - startTime) / kNumIters;
cout << elapsedCyclesPerIter << " clock cycles per iteration\n";
return 0;
}
Issue submission checklist
- [X] I'm reporting a performance issue. It's not a question.
- [X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- [X] There is reproducer code and related data files such as images, videos, models, etc.
When I compile and build using my own source-built OpenVINO with THREADING=SEQ, the cycle count decreases from 28k to 24k:
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_SYSTEM_OPENCL=OFF -DTHREADING=SEQ --install-prefix /opt/openvino -G Ninja ..
Using my own source-built OpenVINO with THREADING=TBB, the cycle count is still 28k:
cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_SYSTEM_OPENCL=OFF -DTHREADING=TBB --install-prefix /opt/openvino -G Ninja ..
I wonder if OpenVINO makes liberal use of threads synchronization primitives like locks and queues even for single-threaded synchronous use cases.
In contrast, a similar setup with onnxruntime using the same model takes 2.8k cycles per inference. It's still high but it's much better.
@jchia We don't have this kind of CPU to reproduce this issue. @wangleis Could you help to answer this question? Thanks!
Hi @jchia, when app use 1 stream, 1 thread and sync inference API, both THREADING=SEQ and THREADING=TBB will use main thread for inference.
For the difference between THREADING=SEQ and THREADING=TBB, agree with your assumption. There is still room for optimization in single-threaded synchronous use case.
@wangleis Do you also think that 24k cycles for THREADING=SEQ case is excessive so that there is room for improvement? I don't have convenient access to a machine with an Intel processor and OpenVino installed, but I expect the numbers there to be similar.
@wenjiew I ran the experiment on a i9-13900H on Debian 12, and the results were similar, 24k cycles for SEQ and 27k cycles for TBB. I don't think the high overhead is sensitive to the target processor for a fairly recent processor.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i9-13900H
CPU family: 6
Model: 186
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 1
Stepping: 2
CPU(s) scaling MHz: 12%
CPU max MHz: 5400.0000
CPU min MHz: 400.0000
BogoMIPS: 5990.40
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
pr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced t
pr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xget
bv1 xsaves avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdi
r64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 544 KiB (14 instances)
L1i: 704 KiB (14 instances)
L2: 11.5 MiB (8 instances)
L3: 24 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-19
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
@jchia THREADING=SEQ is right configuration for your case which is 1 thread and sync inference.
May I know if you request performance optimization for THREADING=SEQ in this issue?
I'm suggesting reducing the fixed latency of the THREADING=SEQ case if it's not too complicated. This will be helpful for cases where the computation graph has relatively few operations so the fixed overhead is a large fraction of the total time.
@jchia Thanks for your suggestion. We had created internal ticket CVS-141653 to follow this input.
Since the feature may not be completed soon, we will close this ticket if no other topics. Thanks