stable_diffusion.openvino icon indicating copy to clipboard operation
stable_diffusion.openvino copied to clipboard

Only 4 threads seem to be used on an 8 thread machine.

Open expenses opened this issue 2 years ago • 22 comments

Hi! This is a really cool piece of work, seems to run approx 2x faster than a native Torch CPU implementation. I did notice that it only uses 4 of the 8 threads on my machine though. I'm new to openvino; is there a way to configure how many threads are used?

expenses avatar Aug 30 '22 11:08 expenses

@expenses hey, did you try to set OMP_NUM_THREADS variable? Something like

export OMP_NUM_THREADS = 8
python stable_diffusion.py ...

bes-dev avatar Aug 30 '22 11:08 bes-dev

@expenses hey, did you try to set OMP_NUM_THREADS variable?

Hmm, doing that doesn't help either. Perhaps there's a hardware reason why I can't use 8 cores for this? I am on a laptop (with a 11th Gen Intel i7-1165G7)

expenses avatar Aug 30 '22 11:08 expenses

@expenses also you can try CPU_THREADS_NUM or CPU_THROUGHPUT_STREAMS variables 🤔 I'm not sure that this is the hardware problem. In my opinion, this is the problem on openvino side. Also, you can try to create an issue in the OpenVINO repo: https://github.com/openvinotoolkit/openvino/issues

bes-dev avatar Aug 30 '22 11:08 bes-dev

Modify the .py file, add this after self.core = Core()

self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})

You can change 8 to be matching the number of cores in your system.

LouDou avatar Aug 31 '22 13:08 LouDou

You can change 8 to be matching the number of cores in your system.

This works in that all eight of my CPU cores go to 100% rather than just four of them, but it doesn't reduce my seconds per iteration at all.

benplumley avatar Aug 31 '22 14:08 benplumley

Yeah you're right, this maybe doesn't do exactly what I thought it did - but it looks the most likely parameter from the openvino docs that I could find (docs which, I must add are pretty hard to read and I am not at all familiar with).

On my Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz (8C8T), Using the same prompt and seed, taking a reading after 3 iterations:

INFERENCE_NUM_THREADS: 1 = 14.50 s/it INFERENCE_NUM_THREADS: 2 = 7.43 s/it INFERENCE_NUM_THREADS: 3 = 5.28 s/it INFERENCE_NUM_THREADS: 4 = 4.34 s/it INFERENCE_NUM_THREADS: 5 = 3.89 s/it INFERENCE_NUM_THREADS: 6 = 3.58 s/it INFERENCE_NUM_THREADS: 7 = 3.42 s/it INFERENCE_NUM_THREADS: 8 = 3.31 s/it

I can see with each increment that an additional core is being used, but this is clearly not scaling linearly.

If I omit the config completely, it uses all 8 cores at 3.36 s/it.

LouDou avatar Aug 31 '22 16:08 LouDou

Same test on my laptop Intel(R) Core(TM) i7-11800H @ 2.30GHz (8C16T)

INFERENCE_NUM_THREADS: 1 = 15.11 s/it INFERENCE_NUM_THREADS: 2 = 7.99 s/it INFERENCE_NUM_THREADS: 3 = 5.84 s/it INFERENCE_NUM_THREADS: 4 = 4.43 s/it INFERENCE_NUM_THREADS: 5 = 3.79 s/it INFERENCE_NUM_THREADS: 6 = 3.40 s/it INFERENCE_NUM_THREADS: 7 = 3.12 s/it INFERENCE_NUM_THREADS: 8 = 2.84 s/it INFERENCE_NUM_THREADS: 9 = 4.13 s/it INFERENCE_NUM_THREADS: 10 = 3.87 s/it INFERENCE_NUM_THREADS: 11 = 3.68 s/it INFERENCE_NUM_THREADS: 12 = 3.29 s/it INFERENCE_NUM_THREADS: 13 = 3.20 s/it INFERENCE_NUM_THREADS: 14 = 3.12 s/it INFERENCE_NUM_THREADS: 15 = 3.07 s/it INFERENCE_NUM_THREADS: 16 = 2.89 s/it

LouDou avatar Aug 31 '22 20:08 LouDou

google colab default = 32 s/it INFERENCE_NUM_THREADS: 0 = 30 s/it INFERENCE_NUM_THREADS: 1 = 37 s/it INFERENCE_NUM_THREADS: 20 = 52 s/it

breadbrowser avatar Aug 31 '22 21:08 breadbrowser

{"INFERENCE_NUM_THREADS": 16} gave me half a second of speed boost per iteration on AMD Ryzen 7 3700X, running at 4 s/it @ 4 GHz. This is quite a lot slower than an Intel apparently, but I guess this is to be expected...

panki27 avatar Aug 31 '22 21:08 panki27

https://www.kaggle.com/code/lostgoldplayer/cpu-stable-diffusion it takes 2 minutes for one image

breadbrowser avatar Aug 31 '22 21:08 breadbrowser

Modify the .py file, add this after self.core = Core()

self.core.set_property("CPU", {"INFERENCE_NUM_THREADS": 8})

You can change 8 to be matching the number of cores in your system.

How to do this within demo.py? Can you post an example using to setup this?

Neowam avatar Sep 01 '22 17:09 Neowam

@Neowam no need to post the entire script...

Line 29 in stable_diffusion_engine.py with the latest version.

panki27 avatar Sep 01 '22 18:09 panki27

Around 3-3.5s/it on a 3800X. Almost as fast as running on a 5700XT with DirectML!

jdluzen avatar Sep 03 '22 02:09 jdluzen

Ryzen 5600X : defaullt = 4.33s/it

INFERENCE_NUM_THREADS: 12 = 3.7s/it INFERENCE_NUM_THREADS: 10 = 3.8s/it INFERENCE_NUM_THREADS: 8 = 4.22s/it INFERENCE_NUM_THREADS: 6 = 4.16s/it

Love being able to run this on CPU!

trash-cant avatar Sep 03 '22 02:09 trash-cant

Intel i5-12600 (linux)

I am using: self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"})

It will works with all compatible cpus.

rncar avatar Sep 05 '22 16:09 rncar

I kind of assume that OpenVINO uses CPU features / instructions that are only available once per core.

Also, keep in mind that half of the threads are "just" hyperthreading, leveraging the fact that CPUs waiting for IO most of the time. In contrast NN inference is mostly CPU bound so it maxes a core out without any gap to squeeze in more instructions while the CPU is waiting.

fhaust avatar Sep 06 '22 13:09 fhaust

Exactly, using self.core.set_property("CPU", {"CPU_BIND_THREAD": "NUMA"}) uses all the cores but not the hyperthreading ones and I get using it the maximum speed after some tests with others threading options, aroung 3.30s/it.

For some tasks hyperthreading is not useful.

Maybe it should be added to the code.

rncar avatar Sep 07 '22 15:09 rncar

AMD Ryzen 5 2400GE (8T4C, 3200MHz):

4 threads: 21.16s/it 8 threads: 14.66s/it

dcz-self avatar Sep 11 '22 14:09 dcz-self

Intel Xeon 2670(3), 1: 23.75s/it 2: 7.22s/it 4: 6.57s/it ... 10: 5.02s/it but 12: 5.66s/it 14: 5.76s/it ... 24: 6.95s/it NUMA: 6.1s/it Under 50% util every test. Why?

Sogvehz avatar Sep 11 '22 16:09 Sogvehz

Btw, i bought one more mem card, so now there is 2 of them. And 2-chanal is great(3,2s/it). So the problem was a mem I/O.

Sogvehz avatar Oct 09 '22 16:10 Sogvehz

Hmm, it seems using only 12 threads on a Ryzen 5900X already maxes performance? Upping from 12 to 24 threads does make my CPU usage go up to 100% utilized, but the speed barely increases at all. I guess hyperthreading on AMD isn't useful at all for this workload?

Ryzen 5900X: INFERENCE_NUM_THREADS on 12, I get 2.42s/it INFERENCE_NUM_THREADS on 24, I get 2.39s/it

Anyone know what the bottleneck is here? Running on NVMe SSD and 2666Mhz RAM.

Seegee avatar Dec 14 '22 20:12 Seegee

It seems that hyperthreading isn't enabled by default. You have to enable it using the property:

compiled_model = core.compile_model(
    model=model,
    device_name=device_name,
    config={properties.hint.enable_hyper_threading(): True},
)

https://docs.openvino.ai/2023.1/openvino_docs_OV_UG_supported_plugins_CPU.html#multi-threading-optimization

dbalabka avatar Oct 07 '23 23:10 dbalabka