dlib icon indicating copy to clipboard operation
dlib copied to clipboard

[Bug]: SIGSEGV in using dnn_face_recognition_ex on AMD Zen5 Arch 9950X + Ubuntu24

Open lambdahuang opened this issue 7 months ago • 4 comments

What Operating System(s) are you seeing this problem on?

Linux (x86-64)

dlib version

19.24

Python version

3.12

Compiler

gcc 12

Expected Behavior

Operating System: Ubuntu 24.04 LTS CPU: AMD 9950x Zen5

  • Tried with ubuntu official distribution with OpenBLAS OpenMP, Pthread.
  • Tried with local compiling OpenBLAS
    • make DYNAMIC_ARCH=1 TARGET=ZEN USE_OPENMP=0 NO_AFFINITY=1
    • make DYNAMIC_ARCH=1 TARGET=HASWELL USE_OPENMP=0 NO_AFFINITY=1

Expected: Compiling and execution successful.

Current Behavior

Crash stack is here:

(gdb) bt
#0  0x00007fffe6842d64 in sgemm_beta_COOPERLAKE () at /lib/x86_64-linux-gnu/libopenblas.so.0
#1  0x00007fffe46c34c5 in ??? () at /lib/x86_64-linux-gnu/libopenblas.so.0
#2  0x00007fffe484b23d in ??? () at /lib/x86_64-linux-gnu/libopenblas.so.0
#3  0x00007fffe484b498 in ??? () at /lib/x86_64-linux-gnu/libopenblas.so.0
#4  0x00007fffdf69caa4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#5  0x00007fffdf729c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

With attempts with all configurations, crash is similar.

Steps to Reproduce

[Reproduce] https://dlib.net/dnn_face_recognition_ex.cpp.html

Anything else?

Observed the issue when using dlib + cuda. There is no issue when using dlib without cuda enabled.

No response

lambdahuang avatar Apr 23 '25 09:04 lambdahuang

Btw, I this seems relevant to how dlib uses openblas, as folks from openblas verified there was no issue in their local test on the same environment: https://github.com/OpenMathLib/OpenBLAS/issues/5243#issuecomment-2823673449

lambdahuang avatar Apr 26 '25 19:04 lambdahuang

Dlib is just calling blas functions. It's not doing anything open blas specific. And given that it's all worked for over a decade with many blas libraries I'm doubtful it's somehow dlib. They are also just function calls. There isn't any special magic or anything.

So maybe your install of openblas is built wrong? I can't say.

davisking avatar Apr 27 '25 23:04 davisking

Hi Davis, I also run into similar issue on ArmV8

Thread 3 "sentry_robot" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff99fcf840 (LWP 7071)]
0x0000ffffe5da9ac0 in sgemm_beta_ARMV8 () from /lib/aarch64-linux-gnu/libopenblas.so.0
(gdb) bt
#0  0x0000ffffe5da9ac0 in sgemm_beta_ARMV8 () at /lib/aarch64-linux-gnu/libopenblas.so.0
#1  0x0000ffffe5cc136c in  () at /lib/aarch64-linux-gnu/libopenblas.so.0

I've verified all combination of OpenBLAS:

  • OpenBLAS binaries from Ubuntu Official on branch 0.3.29 (Ubuntu 24.1)
  • OpenBLAS binaries from Ubuntu Official on branch 0.3.26(Ubuntu 24 LTS default version).
  • OpenBLAS source compiling

And different threading solutions:

  • OpenBLAS with OpenMP
  • OpenBLAS with pthread

I also tested on DLIB + Cuda solution on Armv8 (Jetson Orin AGX 64GB), it seems having the similar issue:

Thread 3 "sentry_robot" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff99fcf840 (LWP 7071)]
0x0000ffffe5da9ac0 in sgemm_beta_ARMV8 () from /lib/aarch64-linux-gnu/libopenblas.so.0
(gdb) bt
#0  0x0000ffffe5da9ac0 in sgemm_beta_ARMV8 () at /lib/aarch64-linux-gnu/libopenblas.so.0
#1  0x0000ffffe5cc136c in  () at /lib/aarch64-linux-gnu/libopenblas.so.0

The reason I suspect that it might be relevant to dlib since the if I run the same product logic with dlib + nocuda, it works fine, but with cuda enabled, it crashes.

lambdahuang avatar Apr 27 '25 23:04 lambdahuang

Warning: this issue has been inactive for 35 days and will be automatically closed on 2025-06-11 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

dlib-issue-bot avatar Jun 02 '25 08:06 dlib-issue-bot

Warning: this issue has been inactive for 42 days and will be automatically closed on 2025-06-11 if there is no further activity.

If you are waiting for a response but haven't received one it's possible your question is somehow inappropriate. E.g. it is off topic, you didn't follow the issue submission instructions, or your question is easily answerable by reading the FAQ, dlib's official compilation instructions, dlib's API documentation, or a Google search.

dlib-issue-bot avatar Jun 09 '25 08:06 dlib-issue-bot

Notice: this issue has been closed because it has been inactive for 45 days. You may reopen this issue if it has been closed in error.

dlib-issue-bot avatar Jun 12 '25 08:06 dlib-issue-bot