OpenBLAS For multithreaded GEMM in OpenBLAS, do I need to call API in the program to ensure thread affinity?

Feb 15 '22 13:02 AnonymousYWL

openblas_set_num_threads()

Has this API ensured thread affinity?

Feb 15 '22 13:02 AnonymousYWL

Nope, it is not assured. Once the affinity code sets main thread's mask to one CPU that has no way out of that core including child processes. That is also described from other point of view in Makefile.rule. Please check this https://www.postgresql.org/message-id/[email protected] for some options to tweak to get (10x)closer to affined threading without abovementioned dangers. By default threads with heavy CPU usage is supposed to stay on different CPUs. With numbers identical they sort of do so. Please measure your ways around the tweaks. Suspecting you actually suspect some fishy performance do following measurements of wall time spent:

one threaded call
A call limited to one NUMA node (if you have server CPU or multiple)
all-threaded call
all-threaded call with tweaks applied

They shall get faster in that order.

If not - you can try drilling down your code's heaviest parts from openblas with perf record ; perf report that is profiler that does not need compiled-in instrumentation. Then rinse and repeat with "reduced sample" - if yo find regression it is very easy to understand here.

Feb 15 '22 14:02 brada4

@AnonymousYWL any update on your side?

Feb 18 '22 15:02 brada4

No, thanks for your reply.

Feb 19 '22 03:02 AnonymousYWL