For multithreaded GEMM in OpenBLAS, do I need to call API in the program to ensure thread affinity?
openblas_set_num_threads()
Has this API ensured thread affinity?
Nope, it is not assured. Once the affinity code sets main thread's mask to one CPU that has no way out of that core including child processes. That is also described from other point of view in Makefile.rule. Please check this https://www.postgresql.org/message-id/[email protected] for some options to tweak to get (10x)closer to affined threading without abovementioned dangers. By default threads with heavy CPU usage is supposed to stay on different CPUs. With numbers identical they sort of do so. Please measure your ways around the tweaks. Suspecting you actually suspect some fishy performance do following measurements of wall time spent:
- one threaded call
- A call limited to one NUMA node (if you have server CPU or multiple)
- all-threaded call
- all-threaded call with tweaks applied
They shall get faster in that order.
If not - you can try drilling down your code's heaviest parts from openblas with perf record ; perf report that is profiler that does not need compiled-in instrumentation. Then rinse and repeat with "reduced sample" - if yo find regression it is very easy to understand here.
@AnonymousYWL any update on your side?
No, thanks for your reply.