OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

OpenBLAS hanged when testing multithreaded affinity

Open MacChen02 opened this issue 5 years ago • 13 comments

OpenBLAS hanged when testing multithreaded affinity.

hang

Enviroment: ARMV8 CentOS 7.6, OpenBLAS-0.3.7 Compile cmd: make TARGET=ARMV8 CC=gcc FC=gfortran DEBUG=1 NO_AFFINITY=0 -j96 Execute cmd: export OMP_NUM_THREADS=32 && ./dgemm.goto 6000 6000

The problem can be reproduced by simulating an abnormal situation. During the region of code manually stopping the process, OpenBLAS-0.3.7/driver/others/init.c code

First exit abnormally before blas_unlock(&common -> lock) , the value of common->lock is 1. This shared memory already exist,common->lock=1 and common->magic=SH_MAGIC,function blas_lock entering infinite loop in next time, the programme will be hanged.

If the problem happened, it makes openblas unavailable.

I provide a patch file, checking the value of common->lock first. 0001-hang-multithread-affinity.patch.txt

@brada4 @martin-frbg

MacChen02 avatar Dec 17 '19 09:12 MacChen02

Looking at it

  • you must start checking thread magic so we manipulate our thread (current logic is exactly reverse)
  • then you zap things only in places you need to if()

No time during xmas, you show the root cause of a long hidden problem that openblas messes with other threads....

brada4 avatar Dec 19 '19 09:12 brada4

Not sure if the existing code (which dates back to GotoBLAS) is actually incorrect for normal operations. Perhaps there needs to be a separate pass to pick up the bits from threads that met an unexpected fate ?

martin-frbg avatar Dec 19 '19 13:12 martin-frbg

Distinguish "ours" from "main" and "others"

brada4 avatar Dec 19 '19 13:12 brada4

@brada4 @martin-frbg How about the patch file?It can solve the problem of abnormal interruption.

The other thread may encounter the following two situations:

  1. the common->lock is held
    1. common->shmid is alive, the thread installs the nop instructs, waiting...
    2. common->shmid is dead, it states that other threads exited abnormally, the thread should clear the abnormally value in this time.
  2. the common->lock is free, it's ok.

I think the "common -> magic != SH_MAGIC" used to waiting other thread handling the numa mapping and so on.

MacChen02 avatar Jan 03 '20 03:01 MacChen02

Sorry, I still do not see how this could occur in a real-life situation, rather than willfully knocking down threads during the early initialization phase of OpenBLAS ?

martin-frbg avatar Jan 03 '20 20:01 martin-frbg

@martin-frbg

The problem has appeared on my device several times, and x86 platform( Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz ) has also appeared several times. It has a small probability in some abnormal situations.

I just reproduced the problem by simulating an abnormal situation, it actually occurs during the early initialization phase of OpenBLAS.

MacChen02 avatar Jan 04 '20 01:01 MacChen02

Does it affect default build that does not play with affinity and allows 10 years fresher operating system scheduler to place processes at processors?

Where you say NUMA it is actually placing threads in order to CPUs, nothing modern there. Improved robustness will help even there too.

More modern approach (given absence of good NUMA awareness, like memory-to-cpu binding) would be here: https://www.postgresql.org/message-id/[email protected]

brada4 avatar Jan 04 '20 09:01 brada4

BTW which compiler are you using ? Seems CentOS 7.6 comes with a very old version of gcc (4.8.5) by default. (I'd still like to understand, and if possible fix, the underlying issue of the "abnormal situations" leading to unexpected, unhandled thread death)

martin-frbg avatar Jan 08 '20 15:01 martin-frbg

@brada4 The problem doesn't affect the default build without affinity, just happen in affinity situation. It has nothing to do with NUMA features.

@martin-frbg The version of gcc is 4.8.5. The root cause of the problem, such as manually aborting the program. The probability is small, i met it, so the issue was born.

Other possible reasons have not been thought of yet.

MacChen02 avatar Jan 09 '20 11:01 MacChen02

There is newer compiler in softwarecollections.org named devtoolset-?-gcc

brada4 avatar Jan 09 '20 12:01 brada4

@brada4 I don't use devtoolset-?-gcc as system compiler. What different from gcc for this problem?

MacChen02 avatar Jan 10 '20 01:01 MacChen02

It is selectable , you dont have to change system compiler a bit dated instruction here: https://github.com/xianyi/OpenBLAS/wiki/faq#binutils

brada4 avatar Jan 10 '20 07:01 brada4

If the scenario is "only" about killing a thread at an inopportune moment where it holds a lock, changing the compiler is unlikely to improve anything. I am still worried that that simple patch could create an equally undersirable and much less obvious problem where a thread could be wrongly pronounced dead during normal operation.

martin-frbg avatar Jan 10 '20 10:01 martin-frbg