openvino [Performance]: High overhead latency for ov::InferRequest::infer()

trafficstars

OpenVINO Version

2024.0.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

OpenVINO installation

Build from source

Programming Language

C++

Hardware Architecture

x86 (64 bits)

Model used

ReduceSum

Model quantization

No

Target Platform

~$ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             32
On-line CPU(s) list:                0-31
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              33
Model name:                         AMD Ryzen 9 5950X 16-Core Processor
Stepping:                           0
Frequency boost:                    enabled
CPU MHz:                            2200.000
CPU max MHz:                        3400.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           6787.36
Virtualization:                     AMD-V
L1d cache:                          512 KiB
L1i cache:                          512 KiB
L2 cache:                           8 MiB
L3 cache:                           64 MiB
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 
                                    constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave
                                     avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfct
                                    r_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rds
                                    eed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpr
                                    u wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spe
                                    c_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
~$ lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
  0    0      0    0 0:0:0:0          yes 3400.0000 2200.0000
  1    0      0    1 1:1:1:0          yes 3400.0000 2200.0000
  2    0      0    2 2:2:2:0          yes 3400.0000 2200.0000
  3    0      0    3 3:3:3:0          yes 3400.0000 2200.0000
  4    0      0    4 4:4:4:0          yes 3400.0000 2200.0000
  5    0      0    5 5:5:5:0          yes 3400.0000 2200.0000
  6    0      0    6 6:6:6:0          yes 3400.0000 2200.0000
  7    0      0    7 7:7:7:0          yes 3400.0000 2200.0000
  8    0      0    8 8:8:8:1          yes 3400.0000 2200.0000
  9    0      0    9 9:9:9:1          yes 3400.0000 2200.0000
 10    0      0   10 10:10:10:1       yes 3400.0000 2200.0000
 11    0      0   11 11:11:11:1       yes 3400.0000 2200.0000
 12    0      0   12 12:12:12:1       yes 3400.0000 2200.0000
 13    0      0   13 13:13:13:1       yes 3400.0000 2200.0000
 14    0      0   14 14:14:14:1       yes 3400.0000 2200.0000
 15    0      0   15 15:15:15:1       yes 3400.0000 2200.0000
 16    0      0    0 0:0:0:0          yes 3400.0000 2200.0000
 17    0      0    1 1:1:1:0          yes 3400.0000 2200.0000
 18    0      0    2 2:2:2:0          yes 3400.0000 2200.0000
 19    0      0    3 3:3:3:0          yes 3400.0000 2200.0000
 20    0      0    4 4:4:4:0          yes 3400.0000 2200.0000
 21    0      0    5 5:5:5:0          yes 3400.0000 2200.0000
 22    0      0    6 6:6:6:0          yes 3400.0000 2200.0000
 23    0      0    7 7:7:7:0          yes 3400.0000 2200.0000
 24    0      0    8 8:8:8:1          yes 3400.0000 2200.0000
 25    0      0    9 9:9:9:1          yes 3400.0000 2200.0000
 26    0      0   10 10:10:10:1       yes 3400.0000 2200.0000
 27    0      0   11 11:11:11:1       yes 3400.0000 2200.0000
 28    0      0   12 12:12:12:1       yes 3400.0000 2200.0000
 29    0      0   13 13:13:13:1       yes 3400.0000 2200.0000
 30    0      0   14 14:14:14:1       yes 3400.0000 2200.0000
 31    0      0   15 15:15:15:1       yes 3400.0000 2200.0000

Performance issue description

Clarification: I did not install from source. I installed with apt from https://apt.repos.intel.com/openvino/2024 but this is not a dropdown option in "OpenVINO installation".

I tested inference speed for a 2-element ReduceSum in C++. The inference time was 28k cycles, which is roughly 8µs on my machine. The ReduceSum computation itself should take only a single-digit number of cycles. This level of latency overhead is not suitable for low-latency applications. Is this intended? Am I missing something? If not, please improve the latency overhead.

Step-by-step reproduction

Copy make-reducesum.py and openvino-slow.cpp to a new directory and cd to it.
Generate ReduceSum ONNX model by running make-reducesum.py with necessary pip packages installed, including onnx and openvino. (The output files get written to /tmp.)
Compile openvino-slow.cpp with g++ -O3 -o openvino-slow openvino-slow.cpp -lopenvino
Run the resulting executable by running taskset -c 3 ./openvino-slow. The taskset -c 3 is to pin the thread to Core 3 and avoid useless kernel-initiated core-switching that can hurt performance. This assumes that on the system, Core 3 is relatively idle.
See the large number of clock cycles reported in the last output line.

make-reducesum.py

#!/bin/env python

import numpy as np
import onnx
from onnx import helper, TensorProto
import openvino as ov

def main():
    graph = helper.make_graph(
        name='SimpleModel',
        nodes=[helper.make_node("ReduceSum", inputs=['x'], name='y', outputs=['y'])],
        inputs=[
            helper.make_tensor_value_info('x', TensorProto.FLOAT, [2]),
        ],
        outputs=[
            helper.make_tensor_value_info('y', TensorProto.FLOAT, None),
        ]
    )

    onnx_model = helper.make_model(graph, producer_name='simple_model')
    onnx.save(onnx_model, '/tmp/reducesum.onnx')

    ov_model = ov.convert_model('/tmp/reducesum.onnx')
    ov.save_model(ov_model, '/tmp/reducesum.xml')

main()

openvino-slow.cpp

#include <cstdint>
#include <iostream>
#include <openvino/openvino.hpp>

using namespace std;

inline uint64_t __attribute__((always_inline)) rdtscAcq(void) {
    unsigned cyclesLow, cyclesHigh;
    asm volatile(
        "CPUID\n\t"
        "RDTSC\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        : "=r"(cyclesHigh), "=r"(cyclesLow)::"%rax", "%rbx", "%rcx", "%rdx");
    return ((uint64_t)cyclesHigh << 32) | cyclesLow;
}

inline uint64_t __attribute__((always_inline)) rdtscRel(void) {
    unsigned cyclesLow, cyclesHigh;
    asm volatile(
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        "CPUID\n\t"
        : "=r"(cyclesHigh), "=r"(cyclesLow)::"%rax", "%rbx", "%rcx", "%rdx");
    return ((uint64_t)cyclesHigh << 32) | cyclesLow;
}


int main() {
    ov::Core core;
    string model_path = "/tmp/reducesum.xml";
    shared_ptr<ov::Model> model = core.read_model(model_path);
    auto ops = model->get_ordered_ops();
    cout << "OPS: ";
    for (auto const& op : ops)
        cout << ' ' << op->description();
    cout << '\n';

    for (auto const& input : model->inputs()) {
        cout << "INPUT";
        for (auto const& name : input.get_names())
            cout << ' ' << name;
        cout << " (";
        bool first = true;
        for (auto dim : input.get_shape()) {
            if (first)
                cout << dim;
            else
                cout << ", " << dim;
            first = false;
        }
        cout << ")\n";
    }
    for (auto const& output : model->outputs()) {
        cout << "OUTPUT";
        for (auto const& name : output.get_names())
            cout << ' ' << name;
        cout << " (";
        bool first = true;
        for (auto dim : output.get_shape()) {
            if (first)
                cout << dim;
            else
                cout << ", " << dim;
            first = false;
        }
        cout << ")\n";
    }

    ov::CompiledModel compiled_model = core.compile_model(
        model,
        ov::inference_num_threads(1),
        ov::num_streams(1),
        ov::enable_profiling(false),
        ov::hint::performance_mode(ov::hint::PerformanceMode::LATENCY)
    );

    {
        cout << "COMPILED MODEL PROPS\n";
        auto supported_properties = compiled_model.get_property(ov::supported_properties);
        for (auto&& property : supported_properties) {
            if (property != ov::supported_properties.name()) {
                cout << "\t\t" << (property.is_mutable() ? "Mutable: " : "Immutable: ") << property << " : ";
                auto props = compiled_model.get_property(property);
                props.print(cout);
                cout << '\n';
            }
        }
    }

    ov::InferRequest req = compiled_model.create_infer_request();

    // Warmup.
    for (int i = 0; i < 10; ++i)
        req.infer();

    auto startTime = rdtscAcq();
    constexpr static int kNumIters = 2000;
    for (int i = 0; i < kNumIters; ++i)
        req.infer();
    auto elapsedCyclesPerIter = (rdtscRel() - startTime) / kNumIters;
    cout << elapsedCyclesPerIter << " clock cycles per iteration\n";

    return 0;
}

Issue submission checklist

[X] I'm reporting a performance issue. It's not a question.
[X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
[X] There is reproducer code and related data files such as images, videos, models, etc.

Mar 15 '24 05:03 jchia

When I compile and build using my own source-built OpenVINO with THREADING=SEQ, the cycle count decreases from 28k to 24k:

cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_SYSTEM_OPENCL=OFF -DTHREADING=SEQ --install-prefix /opt/openvino -G Ninja ..

Using my own source-built OpenVINO with THREADING=TBB, the cycle count is still 28k:

cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_SYSTEM_OPENCL=OFF -DTHREADING=TBB --install-prefix /opt/openvino -G Ninja ..

I wonder if OpenVINO makes liberal use of threads synchronization primitives like locks and queues even for single-threaded synchronous use cases.

In contrast, a similar setup with onnxruntime using the same model takes 2.8k cycles per inference. It's still high but it's much better.

Mar 15 '24 10:03 jchia

@jchia We don't have this kind of CPU to reproduce this issue. @wangleis Could you help to answer this question? Thanks!

Apr 12 '24 02:04 wenjiew

Hi @jchia, when app use 1 stream, 1 thread and sync inference API, both THREADING=SEQ and THREADING=TBB will use main thread for inference.

For the difference between THREADING=SEQ and THREADING=TBB, agree with your assumption. There is still room for optimization in single-threaded synchronous use case.

Apr 12 '24 02:04 wangleis

@wangleis Do you also think that 24k cycles for THREADING=SEQ case is excessive so that there is room for improvement? I don't have convenient access to a machine with an Intel processor and OpenVino installed, but I expect the numbers there to be similar.

Apr 12 '24 08:04 jchia

@wenjiew I ran the experiment on a i9-13900H on Debian 12, and the results were similar, 24k cycles for SEQ and 27k cycles for TBB. I don't think the high overhead is sensitive to the target processor for a fairly recent processor.

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  20
  On-line CPU(s) list:   0-19
Vendor ID:               GenuineIntel
  Model name:            13th Gen Intel(R) Core(TM) i9-13900H
    CPU family:          6
    Model:               186
    Thread(s) per core:  2
    Core(s) per socket:  14
    Socket(s):           1
    Stepping:            2
    CPU(s) scaling MHz:  12%
    CPU max MHz:         5400.0000
    CPU min MHz:         400.0000
    BogoMIPS:            5990.40
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
                         pr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced t
                         pr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xget
                         bv1 xsaves avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdi
                         r64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   544 KiB (14 instances)
  L1i:                   704 KiB (14 instances)
  L2:                    11.5 MiB (8 instances)
  L3:                    24 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-19
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Apr 12 '24 10:04 jchia

@jchia THREADING=SEQ is right configuration for your case which is 1 thread and sync inference.

May I know if you request performance optimization for THREADING=SEQ in this issue?

Apr 17 '24 08:04 wangleis

I'm suggesting reducing the fixed latency of the THREADING=SEQ case if it's not too complicated. This will be helpful for cases where the computation graph has relatively few operations so the fixed overhead is a large fraction of the total time.

Apr 17 '24 11:04 jchia

@jchia Thanks for your suggestion. We had created internal ticket CVS-141653 to follow this input.

Since the feature may not be completed soon, we will close this ticket if no other topics. Thanks

May 20 '24 03:05 wangleis

openvino openvino copied to clipboard

[Performance]: High overhead latency for ov::InferRequest::infer()

OpenVINO Version

Operating System

Device used for inference

OpenVINO installation

Programming Language

Hardware Architecture

Model used

Model quantization

Target Platform

Performance issue description

Step-by-step reproduction

make-reducesum.py

openvino-slow.cpp

Issue submission checklist

openvino
openvino copied to clipboard