onnxruntime
onnxruntime copied to clipboard
performance is poor when onnxruntime C++ run in intel cpu
i have two onnxruntime session running at intel cpu : (1) at first total time is 200ms, (2) when test many times later, speed is 10s. (3) when nothing to do several min later, speed is 200ms again.
why change so much, thanks! (1) try multhread option (2) try session_options.AddConfigEntry("session.set_denormal_as_zero", "1");
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centos 7
- ONNX Runtime installed from (source or binary): binary
- ONNX Runtime version: C++ 12.0
- Python version: --
- Visual Studio version (if applicable): ---
- GCC/Compiler version (if compiling from source): --
- CUDA/cuDNN version: --
- GPU model and memory: ----
To Reproduce
(1) at first total time is 200ms, (2) when test many times later, speed is 10s. (3) when nothing to do several min later, speed is 200ms again.
Expected behavior first and end time cost should same.
Screenshots If applicable, add screenshots to help explain your problem.
Onnxruntime session would never have the first cold run exhibit the same performance. You would always need to have a couple of runs after the session is first created.
After you stop the activity, CPU caches grow cold, but recover quickly. Do you have a real time scenario where incoming requests depend on the user activity? We have work to do in this area, but originally Onnxruntime has been optimized for continuous processing so any suggestions would not provide desired results at this time.
A few things to try out depending on your model.
- Since you are running on CPU, disable memory arena, it does not help in CPU scenarios.
Ort::SessionOptions sessionOptions;
sessionOptions.DisableCpuMemArena();
- Play with the number of intra threads in session options and see what gives you the best performance, using
sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount);
- Try overriding default allocator with MiMalloc. You can use
LD_PRELOAD
for a quick try.
1 sessionOptions.DisableCpuMemArena(); 2 sessionOptions.SetIntraOpNumThreads(options.IntraThreadCount) 3 LD_PRELOAD mimalloc
the three action I have try it, sorry that performance is same as before.
Please provide the full code to reproduce and show how you are measuring performance. As you say you have two onnxruntime sessions it's not clear how/when you are creating those sessions.
In the initial cycle, the time consumption is relatively small, and the later time is very much, such as: First serval loop(include step1 and step2) at main function toatl3 cost time is 200ms,but after 10-20s loop cost time is 10s , even more. It is found that the total1 or total2 consumes the most time.
#include <onnxruntime/core/session/experimental_onnxruntime_cxx_api.h>
thread_pool1 = std::make_unique<ThreadPool>(1)
thread_pool2 = std::make_unique<ThreadPool>(1)
session_options.AddConfigEntry("session.set_denormal_as_zero", "1");
session_options.DisableCpuMemArena();
session_options.SetIntraOpNumThreads(4);
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
session1 = new Ort::Experimental::Session(env, mode_path1, session_options);
session2 = new Ort::Experimental::Session(env, mode_path2, session_options);
std::vector<float> step1() {
auto task = [&, this] {
Timer t1;
auto ort_outputs = session1->Run(session1->GetInputNames(), input, output_names);
Timer t2;
toatl1 = t2-t1;
cout << total use time << toatl1;
}
auto result = thread_pool1->enqueue(task);
return result.get();
}
std::vector<float> step2() {
auto task = [&, this] {
Timer t1;
auto ort_outputs = session2->Run(session2->GetInputNames(), input, output_names);
Timer t2;
toatl2 = t2-t1;
cout << total use time << toatl2;
}
auto result = thread_pool2->enqueue(task);
return result.get();
}
main() {
for (const auto &image : images) {
Timer t1;
auto image1 = step1(image);
auto ret1 = step2(image1);
Timer t2;
toatl3 = t2-t1;
cout << total use time << toatl3;
}
@skottmckay
It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the Run
and not all the other things going on.
Send one warmup query to each inference session, and measure performance for the following calls.
Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.
In the long running time, the memory and CPU are not changed too much, which is basically the same as before.
yes , i used is high_resolution_clock::now()
i use the ORT performance separately with no thread pools ,the same time as before @skottmckay
It would be best to measure the ORT performance separately with no thread pools, and without the inline call to GetInputNames(). That way you're just measuring the cost of the
Run
and not all the other things going on.Send one warmup query to each inference session, and measure performance for the following calls.
Also not clear what Timer is. Is that a high resolution timer or not? https://en.cppreference.com/w/cpp/chrono/high_resolution_clock/now would be preferable.
reply: In the long running time, the memory and CPU are not changed too much, which is basically the same as before. i used is high_resolution_clock::now() i use the ORT performance separately with no thread pools ,the same time as before
can you give me more suggestions? thank you!
by gdb: 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0 Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7.x86_64 (gdb) bt #0 0x00007efd417a01a9 in onnxruntime::concurrency::ThreadPool::RunInParallel(std::function<void (unsigned int)>, unsigned int) () from /usr/local/lib64/libonnxruntime.so.1.6.0 #1 0x00007efd417a05ce in onnxruntime::concurrency::ThreadPool::ParallelForFixedBlockSizeScheduling(long, long, std::function<void (long, long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0 #2 0x00007efd417a06a5 in onnxruntime::concurrency::ThreadPool::SimpleParallelFor(long, std::function<void (long)> const&) () from /usr/local/lib64/libonnxruntime.so.1.6.0 #3 0x00007efd417ef558 in MlasExecuteThreaded(void ()(void, int), void*, int, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0 #4 0x00007efd417b98fc in MlasNchwcConv(long const*, long const*, long const*, long const*, long const*, long const*, unsigned long, float const*, float const*, float const*, float*, MLAS_ACTIVATION const*, bool, onnxruntime::concurrency::ThreadPool*) () from /usr/local/lib64/libonnxruntime.so.1.6.0