unit icon indicating copy to clipboard operation
unit copied to clipboard

signal 11 (core dumped) 3 mins after making a call

Open Shaun-3adesign opened this issue 7 months ago • 42 comments

Bug Overview

when we make call to a flask app we get a core dump around 3 mins after, this is present in versions 1.32.2, 1.33.0, 1.34.0, 1.34.2 the call we are making makes a insert or a delete to a mysql database using mysql-python-connector

Expected Behavior

should not core dump

Steps to Reproduce the Bug

run application and make a post or delete call

Environment Details

  • Target deployment platform: local server in a docker container
  • Target OS: Ubuntu Desktop 22.04
  • Version of this project or specific commit: 1.34.2
  • Version of any relevant project languages: Python 3.11.2, Docker version 24.0.7, build afdd53b

listener.json config.json Dockerfile.txt docker-entrypoint.txt

Additional Context

debug log server.txt

Shaun-3adesign avatar May 01 '25 15:05 Shaun-3adesign

HI, thanks for the report.

when we make call to a flask app we get a core dump around 3 mins after, this is present im versions 1.32.2, 1.33.0, 1.34.0, 1.34.2

Interesting, so IIUC, you can make a single request to the application, wait ~3mins and the the application process will crash?

Is this 100% the case and is it always after around 3 minutes?.

The good news is your getting a core dump (as I don't expect you have a handy reproducer).

Worked with coredumps before? I would really love a backtrace...

I'm going to assume you know where the coredumps are going, so next time you get one can you simply do

$ gdb /path/to/unitd /path/to/coredump
...
(gdb) bt

And paste the output. You may or may not get symbols displayed and you may need to install debuginfo packges for unit/python, but lets see what we get first.

ac000 avatar May 01 '25 17:05 ac000

Another quick check would be to see if the problem happens without threads, i.e. comment out "threads": 4, (or change it to 1) in the config.

ac000 avatar May 01 '25 17:05 ac000

HI, thanks for the report.

when we make call to a flask app we get a core dump around 3 mins after, this is present im versions 1.32.2, 1.33.0, 1.34.0, 1.34.2

Interesting, so IIUC, you can make a single request to the application, wait ~3mins and the the application process will crash?

Is this 100% the case and is it always after around 3 minutes?.

The good news is your getting a core dump (as I don't expect you have a handy reproducer).

Worked with coredumps before? I would really love a backtrace...

I'm going to assume you know where the coredumps are going, so next time you get one can you simply do

$ gdb /path/to/unitd /path/to/coredump ... (gdb) bt And paste the output. You may or may not get symbols displayed and you may need to install debuginfo packges for unit/python, but lets see what we get first.

we can reproduce the issue very easily. thou the core dump files don't seem to be getting saved anywhere.

Shaun-3adesign avatar May 02 '25 07:05 Shaun-3adesign

let me know if you need more as i've not done this before

(gdb) bt #0 0x00006356a3f4a093 in ?? () #1 0x00006356a3f4a1c9 in ?? () #2 0x00006356a3f3200e in nxt_event_engine_start () #3 0x00006356a3f2fcc6 in ?? () #4 0x00007997955671f5 in start_thread (arg=) at ./nptl/pthread_create.c:442 #5 0x00007997955e6b00 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

also we tested the with 1 thread and the issue is still present

Shaun-3adesign avatar May 02 '25 13:05 Shaun-3adesign

the flask server we are running is running an opanapi server just connecting to the swagger page causes the core dump. so not linked to doing a post.

2025/05/02 13:27:35 [info] 198#198 router started 2025/05/02 13:27:35 [info] 198#198 OpenSSL 3.0.15 3 Sep 2024, 300000f0 2025/05/02 13:27:35 [info] 199#199 "flask" prototype started 2025/05/02 13:27:35 [info] 200#200 "flask" application started 2025/05/02 13:27:36 [info] 226#226 "flask" application started 2025/05/02 13:27:36 [info] 252#252 "flask" application started 2025/05/02 13:27:36 [info] 278#278 "flask" application started 172.18.0.1 - - [02/May/2025:13:29:37 +0000] "GET /resource_manager/ui/ HTTP/1.1" 200 1498 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.032" 172.18.0.1 - - [02/May/2025:13:29:37 +0000] "GET /resource_manager/ui/swagger-ui.css HTTP/1.1" 200 143669 "https://localhost:20018/resource_manager/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.012" 172.18.0.1 - - [02/May/2025:13:29:37 +0000] "GET /resource_manager/ui/swagger-ui-bundle.js HTTP/1.1" 200 1091405 "https://localhost:20018/resource_manager/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.036" 172.18.0.1 - - [02/May/2025:13:29:38 +0000] "GET /resource_manager/ui/swagger-ui-standalone-preset.js HTTP/1.1" 200 337216 "https://localhost:20018/resource_manager/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.004" 172.18.0.1 - - [02/May/2025:13:29:38 +0000] "GET /resource_manager/ui/favicon-32x32.png HTTP/1.1" 200 628 "https://localhost:20018/resource_manager/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.004" 172.18.0.1 - - [02/May/2025:13:29:38 +0000] "GET /resource_manager/openapi.json HTTP/1.1" 200 55553 "https://localhost:20018/resource_manager/ui/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36 Edg/134.0.0.0" "0.063" 2025/05/02 13:32:34 [alert] 7#7 process 198 exited on signal 11 (core dumped) 2025/05/02 13:32:34 [info] 336#336 router started 2025/05/02 13:32:34 [info] 336#336 OpenSSL 3.0.15 3 Sep 2024, 300000f0 2025/05/02 13:32:34 [info] 345#345 "flask" prototype started 2025/05/02 13:32:34 [info] 346#346 "flask" application started 2025/05/02 13:32:34 [notice] 199#199 app process 252 exited with code 0 2025/05/02 13:32:34 [notice] 199#199 app process 278 exited with code 0 2025/05/02 13:32:34 [notice] 199#199 app process 200 exited with code 0 2025/05/02 13:32:34 [notice] 199#199 app process 226 exited with code 0 2025/05/02 13:32:34 [notice] 7#7 process 199 exited with code 0 2025/05/02 13:32:34 [info] 372#372 "flask" application started 2025/05/02 13:32:34 [info] 398#398 "flask" application started 2025/05/02 13:32:35 [info] 424#424 "flask" application started

Shaun-3adesign avatar May 02 '25 13:05 Shaun-3adesign

Thanks for the backtrace and the log output.

It actually looks like it's the router process that's crashing and not the application process.

Unfortunately as feared you are missing most of the debug symbols. Lets see if we can fix that

Assuming you installed from packages, you should be able to get debuginfo by installing the unit-dbg & unit-python-3.11-dbg (just in case) packages.

Then (and you can probably use the same coredump) could you provide the output from both the bt & bt full gdb commands? Thanks.

ac000 avatar May 02 '25 15:05 ac000

i'n getting the following when install unit-python-3.11-dbg

root@ee67afd20b2b:/tmp# apt install unit-python-3.11-dbg Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package unit-python-3.11-dbg E: Couldn't find any package by glob 'unit-python-3.11-dbg'

Shaun-3adesign avatar May 02 '25 15:05 Shaun-3adesign

here is the backtrace from just installing unit-dbg

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `unit: router                                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  nxt_h1p_complete_buffers (task=task@entry=0x72a9e4001620, h1p=h1p@entry=0x6255219bc340, all=all@entry=1) at src/nxt_h1proto.c:1522
1522    src/nxt_h1proto.c: No such file or directory.
[Current thread is 1 (Thread 0x72a9ea5266c0 (LWP 305))]
(gdb) bt
#0  nxt_h1p_complete_buffers (task=task@entry=0x72a9e4001620, h1p=h1p@entry=0x6255219bc340, all=all@entry=1) at src/nxt_h1proto.c:1522
#1  0x000062550e4cb46f in nxt_h1p_shutdown (task=0x72a9e4001620, c=0x72a9e4001550) at src/nxt_h1proto.c:2129
#2  0x000062550e4b75c2 in nxt_event_engine_start (engine=0x6255219bc0b0) at src/nxt_event_engine.c:542
#3  0x000062550e4b5bf1 in nxt_thread_trampoline (data=0x6255219339e0) at src/nxt_thread.c:126
#4  0x000072a9ebb621f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x000072a9ebbe1b00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  nxt_h1p_complete_buffers (task=task@entry=0x72a9e4001620, h1p=h1p@entry=0x6255219bc340, all=all@entry=1) at src/nxt_h1proto.c:1522
        size = <optimized out>
        b = 0x0
        in = <optimized out>
        next = <optimized out>
        c = 0x0
#1  0x000062550e4cb46f in nxt_h1p_shutdown (task=0x72a9e4001620, c=0x72a9e4001550) at src/nxt_h1proto.c:2129
        timer = <optimized out>
        h1p = 0x6255219bc340
#2  0x000062550e4b75c2 in nxt_event_engine_start (engine=0x6255219bc0b0) at src/nxt_event_engine.c:542
        obj = 0x72a9e40015b8
        data = 0x0
        task = 0x72a9e4001620
        timeout = <optimized out>
        now = <optimized out>
        thr = <optimized out>
        handler = 0x62550e4b7a80 <nxt_timer_handler>
#3  0x000062550e4b5bf1 in nxt_thread_trampoline (data=0x6255219339e0) at src/nxt_thread.c:126
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {126074106308288, 255495596939513932, -1368, 11, 140734449295984, 126074097917952, -1812585321242455988, -4097801277494816692}, __mask_was_saved = 0}}, __pad = {0x72a9ea525a30, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x62550e4b5a60 <nxt_thread_time_cleanup>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        thr = <optimized out>
        link = 0x6255219339e0
        start = 0x62550e4c0c90 <nxt_router_thread_start>
#4  0x000072a9ebb621f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5  0x000072a9ebbe1b00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Shaun-3adesign avatar May 02 '25 15:05 Shaun-3adesign

i'n getting the following when install unit-python-3.11-dbg

root@ee67afd20b2b:/tmp# apt install unit-python-3.11-dbg Reading package lists... Done Building dependency tree... Done Reading state information... Done E: Unable to locate package unit-python-3.11-dbg E: Couldn't find any package by glob 'unit-python-3.11-dbg'

fixed this the issue was the name its python3.11-dbg

Shaun-3adesign avatar May 02 '25 16:05 Shaun-3adesign

bug trace with both unit-dbg & unit-python3.11-dbg installed

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `unit: router                                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  nxt_h1p_complete_buffers (task=task@entry=0x7a1904001620, h1p=h1p@entry=0x636b874f99f0, all=all@entry=1) at src/nxt_h1proto.c:1522
1522    src/nxt_h1proto.c: No such file or directory.
[Current thread is 1 (Thread 0x7a1913fff6c0 (LWP 306))]
(gdb) bt
#0  nxt_h1p_complete_buffers (task=task@entry=0x7a1904001620, h1p=h1p@entry=0x636b874f99f0, all=all@entry=1) at src/nxt_h1proto.c:1522
#1  0x0000636b7063246f in nxt_h1p_shutdown (task=0x7a1904001620, c=0x7a1904001550) at src/nxt_h1proto.c:2129
#2  0x0000636b7061e5c2 in nxt_event_engine_start (engine=0x636b874f9760) at src/nxt_event_engine.c:542
#3  0x0000636b7061cbf1 in nxt_thread_trampoline (data=0x636b87467350) at src/nxt_thread.c:126
#4  0x00007a191c65c1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007a191c6dbb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  nxt_h1p_complete_buffers (task=task@entry=0x7a1904001620, h1p=h1p@entry=0x636b874f99f0, all=all@entry=1) at src/nxt_h1proto.c:1522
        size = <optimized out>
        b = 0x0
        in = <optimized out>
        next = <optimized out>
        c = 0x0
#1  0x0000636b7063246f in nxt_h1p_shutdown (task=0x7a1904001620, c=0x7a1904001550) at src/nxt_h1proto.c:2129
        timer = <optimized out>
        h1p = 0x636b874f99f0
#2  0x0000636b7061e5c2 in nxt_event_engine_start (engine=0x636b874f9760) at src/nxt_event_engine.c:542
        obj = 0x7a19040015b8
        data = 0x0
        task = 0x7a1904001620
        timeout = <optimized out>
        now = <optimized out>
        thr = <optimized out>
        handler = 0x636b7061ea80 <nxt_timer_handler>
#3  0x0000636b7061cbf1 in nxt_thread_trampoline (data=0x636b87467350) at src/nxt_thread.c:126
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {134248128313024, 5039262364208667989, -1368, 11, 140726245626752, 134248119922688, -5630298242980307627, -8990892774217671339}, __mask_was_saved = 0}}, __pad = {0x7a1913ffea30, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x636b7061ca60 <nxt_thread_time_cleanup>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        thr = <optimized out>
        link = 0x636b87467350
        start = 0x636b70627c90 <nxt_router_thread_start>
#4  0x00007a191c65c1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5  0x00007a191c6dbb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Shaun-3adesign avatar May 02 '25 16:05 Shaun-3adesign

Thanks.

#0  nxt_h1p_complete_buffers (task=task@entry=0x7a1904001620, h1p=h1p@entry=0x636b874f99f0, all=all@entry=1) at src/nxt_h1proto.c:1522
        size = <optimized out>
        b = 0x0
        in = <optimized out>
        next = <optimized out>
        c = 0x0

c is NULL here which we're not expecting.

If you are able to, you could try this patch and see if anything else falls out...

diff --git ./src/nxt_h1proto.c ./src/nxt_h1proto.c
index 9a9ad553..d0e03077 100644
--- ./src/nxt_h1proto.c
+++ ./src/nxt_h1proto.c
@@ -1519,7 +1519,7 @@ nxt_h1p_complete_buffers(nxt_task_t *task, nxt_h1proto_t *h1p, nxt_bool_t all)
 
     b = h1p->buffers;
     c = h1p->conn;
-    in = c->read;
+    in = c ? c->read : NULL;
 
     if (b != NULL) {
         if (in == NULL) {

ac000 avatar May 02 '25 16:05 ac000

Just to confirm again.

If you hit the problematic page, then after about 3 minutes, Unit crashes?

It's only that one page that causes the crash?

If you leave the page loaded in the browse, does it survive?r

ac000 avatar May 02 '25 17:05 ac000

we make a call to the server this could come from a python process using python requests, but in our tests here we are using a browser. after making the call after about 3 minutes we get the crash the webpage looks fine and we can continue making requests, if we refresh the page we get another crash after 3 mins.

Shaun-3adesign avatar May 02 '25 17:05 Shaun-3adesign

ok I tested the patch and still have the same issue.

latest backtrace

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `unit: router                                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  nxt_h1p_complete_buffers (task=task@entry=0x732064001620, h1p=h1p@entry=0x584bc2be89c0, all=all@entry=1) at src/nxt_h1proto.c:1522
1522    src/nxt_h1proto.c: No such file or directory.
[Current thread is 1 (Thread 0x73206b5b06c0 (LWP 297))]
(gdb) bt
#0  nxt_h1p_complete_buffers (task=task@entry=0x732064001620, h1p=h1p@entry=0x584bc2be89c0, all=all@entry=1) at src/nxt_h1proto.c:1522
#1  0x0000584ba231b46f in nxt_h1p_shutdown (task=0x732064001620, c=0x732064001550) at src/nxt_h1proto.c:2129
#2  0x0000584ba23075c2 in nxt_event_engine_start (engine=0x584bc2be8730) at src/nxt_event_engine.c:542
#3  0x0000584ba2305bf1 in nxt_thread_trampoline (data=0x584bc2b6a630) at src/nxt_thread.c:126
#4  0x000073206c3eb1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x000073206c46ab00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  nxt_h1p_complete_buffers (task=task@entry=0x732064001620, h1p=h1p@entry=0x584bc2be89c0, all=all@entry=1) at src/nxt_h1proto.c:1522
        size = <optimized out>
        b = 0x0
        in = <optimized out>
        next = <optimized out>
        c = 0x0
#1  0x0000584ba231b46f in nxt_h1p_shutdown (task=0x732064001620, c=0x732064001550) at src/nxt_h1proto.c:2129
        timer = <optimized out>
        h1p = 0x584bc2be89c0
#2  0x0000584ba23075c2 in nxt_event_engine_start (engine=0x584bc2be8730) at src/nxt_event_engine.c:542
        obj = 0x7320640015b8
        data = 0x0
        task = 0x732064001620
        timeout = <optimized out>
        now = <optimized out>
        thr = <optimized out>
        handler = 0x584ba2307a80 <nxt_timer_handler>
#3  0x0000584ba2305bf1 in nxt_thread_trampoline (data=0x584bc2b6a630) at src/nxt_thread.c:126
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {126583077275328, 1042248611634940760, -1368, 11, 140720990601120, 126583068884992, -1714154970403239080, -4692315440823465128}, __mask_was_saved = 0}}, __pad = {0x73206b5afa30, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x584ba2305a60 <nxt_thread_time_cleanup>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        thr = <optimized out>
        link = 0x584bc2b6a630
        start = 0x584ba2310c90 <nxt_router_thread_start>
#4  0x000073206c3eb1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5  0x000073206c46ab00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Shaun-3adesign avatar May 02 '25 20:05 Shaun-3adesign

Weird. Are you absolutely sure you tested the patched version?

With that patch, in this case, b & in should both then be NULL. We should then just skip past both the outer if()'s, however the key bit being that if c is NULL we won't attempt to read c->read.

Are websockets involved?

ac000 avatar May 03 '25 01:05 ac000

Actually, would it be possible for you test current master?

There is a remote possibility it may make a difference... (there is some h1proto related changes...)

ac000 avatar May 03 '25 01:05 ac000

so it seems my patched version was being replaced when building the image. I have fixed this and I do still get a core dump but its different

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `unit: router                                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005ca23e878d53 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
70      src/nxt_rbtree.h: No such file or directory.
[Current thread is 1 (Thread 0x7ba386edf6c0 (LWP 297))]
(gdb) bt
#0  0x00005ca23e878d53 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
#1  nxt_rbtree_delete (tree=tree@entry=0x5ca27462b160, part=part@entry=0x7ba380001618) at src/nxt_rbtree.c:305
#2  0x00005ca23e84dbc9 in nxt_timer_changes_commit (engine=0x5ca27462af30) at src/nxt_timer.c:201
#3  0x00005ca23e84df38 in nxt_timer_find (engine=engine@entry=0x5ca27462af30) at src/nxt_timer.c:241
#4  0x00005ca23e84d7cc in nxt_event_engine_start (engine=0x5ca27462af30) at src/nxt_event_engine.c:547
#5  0x00005ca23e84bdc1 in nxt_thread_trampoline (data=0x5ca2745ad570) at src/nxt_thread.c:126
#6  0x00007ba387d1c1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ba387d9bb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00005ca23e878d53 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
No locals.
#1  nxt_rbtree_delete (tree=tree@entry=0x5ca27462b160, part=part@entry=0x7ba380001618) at src/nxt_rbtree.c:305
        color = <optimized out>
        node = 0x7ba380001618
        sentinel = 0x5ca27462b160
        subst = 0x7ba380001618
        child = <optimized out>
#2  0x00005ca23e84dbc9 in nxt_timer_changes_commit (engine=0x5ca27462af30) at src/nxt_timer.c:201
        timer = 0x7ba380001618
        timers = 0x5ca27462b160
        ch = 0x5ca274633dd0
        end = 0x5ca274633de0
        add = 0x5ca274633dd0
        add_end = 0x5ca274633de0
#3  0x00005ca23e84df38 in nxt_timer_find (engine=engine@entry=0x5ca27462af30) at src/nxt_timer.c:241
        delta = <optimized out>
        time = <optimized out>
        timer = <optimized out>
        timers = 0x5ca27462b160
        tree = <optimized out>
        node = <optimized out>
        next = <optimized out>
#4  0x00005ca23e84d7cc in nxt_event_engine_start (engine=0x5ca27462af30) at src/nxt_event_engine.c:547
        obj = 0x7ba380001550
        data = 0x5ca27462af30
        task = 0x0
        timeout = <optimized out>
        now = <optimized out>
        thr = <optimized out>
        handler = <optimized out>
#5  0x00005ca23e84bdc1 in nxt_thread_trampoline (data=0x5ca2745ad570) at src/nxt_thread.c:126
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {135942273627840, -165853751974725296, -688, 11, 140736538951552, 135942265237504, 
                789758586193541456, 4969363667880999248}, __mask_was_saved = 0}}, __pad = {0x7ba386edecf0, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x5ca23e84bc30 <nxt_thread_time_cleanup>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        thr = <optimized out>
        link = 0x5ca2745ad570
        start = 0x5ca23e856e60 <nxt_router_thread_start>
#6  0x00007ba387d1c1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#7  0x00007ba387d9bb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Shaun-3adesign avatar May 03 '25 08:05 Shaun-3adesign

Actually, would it be possible for you test current master?

There is a remote possibility it may make a difference... (there is some h1proto related changes...)

I'll try this later. did you want me to include the patch? I assume I will need to build the python module as well?

Shaun-3adesign avatar May 03 '25 09:05 Shaun-3adesign

just tested on master branch same issue, this was including the patch

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `unit: router                                               '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005a96708f5403 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
70      src/nxt_rbtree.h: No such file or directory.
[Current thread is 1 (Thread 0x7e0dbffff6c0 (LWP 301))]
(gdb) bt
#0  0x00005a96708f5403 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
#1  nxt_rbtree_delete (tree=tree@entry=0x5a9687477d30, part=part@entry=0x7e0db4001618) at src/nxt_rbtree.c:305
#2  0x00005a96708c8ae9 in nxt_timer_changes_commit (engine=0x5a9687477b00) at src/nxt_timer.c:201
#3  0x00005a96708c8e58 in nxt_timer_find (engine=engine@entry=0x5a9687477b00) at src/nxt_timer.c:241
#4  0x00005a96708c86ec in nxt_event_engine_start (engine=0x5a9687477b00) at src/nxt_event_engine.c:547
#5  0x00005a96708c6ce1 in nxt_thread_trampoline (data=0x5a96873ddec0) at src/nxt_thread.c:126
#6  0x00007e0dc906f1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007e0dc90eeb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0  0x00005a96708f5403 in nxt_rbtree_branch_min (node=0x4, tree=<optimized out>) at src/nxt_rbtree.h:70
No locals.
#1  nxt_rbtree_delete (tree=tree@entry=0x5a9687477d30, part=part@entry=0x7e0db4001618) at src/nxt_rbtree.c:305
        color = <optimized out>
        node = 0x7e0db4001618
        sentinel = 0x5a9687477d30
        subst = 0x7e0db4001618
        child = <optimized out>
#2  0x00005a96708c8ae9 in nxt_timer_changes_commit (engine=0x5a9687477b00) at src/nxt_timer.c:201
        timer = 0x7e0db4001618
        timers = 0x5a9687477d30
        ch = 0x5a96874809a0
        end = 0x5a96874809b0
        add = 0x5a96874809a0
        add_end = 0x5a96874809b0
#3  0x00005a96708c8e58 in nxt_timer_find (engine=engine@entry=0x5a9687477b00) at src/nxt_timer.c:241
        delta = <optimized out>
        time = <optimized out>
        timer = <optimized out>
        timers = 0x5a9687477d30
        tree = <optimized out>
        node = <optimized out>
        next = <optimized out>
#4  0x00005a96708c86ec in nxt_event_engine_start (engine=0x5a9687477b00) at src/nxt_event_engine.c:547
        obj = 0x7e0db4001550
        data = 0x5a9687477b00
        task = 0x0
        timeout = <optimized out>
        now = <optimized out>
        thr = <optimized out>
        handler = <optimized out>
#5  0x00005a96708c6ce1 in nxt_thread_trampoline (data=0x5a96873ddec0) at src/nxt_thread.c:126
        __cancel_buf = {__cancel_jmp_buf = {{__cancel_jmp_buf = {138597520897728, -7122281061378609134, -1376, 11, 140722096742176, 138597512507392, 
                7004202308968460306, 2883555026879328274}, __mask_was_saved = 0}}, __pad = {0x7e0dbfffea30, 0x0, 0x0, 0x0}}
        __cancel_routine = 0x5a96708c6b50 <nxt_thread_time_cleanup>
        __cancel_arg = <optimized out>
        __not_first_call = <optimized out>
        thr = <optimized out>
        link = 0x5a96873ddec0
        start = 0x5a96708d1d80 <nxt_router_thread_start>
#6  0x00007e0dc906f1f5 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#7  0x00007e0dc90eeb00 in clone () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Shaun-3adesign avatar May 03 '25 11:05 Shaun-3adesign

Thanks for all the testing.

The only 3 minutes thing I can immediately see in Unit is the idle_timeout setting...

Maximum number of seconds between requests in a keep-alive connection. If no new requests arrive within this interval, Unit returns a 408 “Request Timeout” response and closes the connection.

So it looks like we are hitting this and then trying to shutdown the connection (which matches the location of the crash).

Looks like I'll need to try and figure out how to reproduce this.

One last question for now, could you paste the headers being sent by the browser?

ac000 avatar May 03 '25 15:05 ac000

I'll see about uploading sone files so you can build a image with the issue in.

thou is there currently anything we can do to stop this happening so much?

Shaun-3adesign avatar May 06 '25 12:05 Shaun-3adesign

I'll see about uploading sone files so you can build a image with the issue in.

Thanks!

thou is there currently anything we can do to stop this happening so much?

Unfortunately it seems not.

Though there is one thing you could test, which might help narrow the problem down, try adding the following to your settings.http unit config

    "idle_timeout": 60

See if that makes the crashes happen after only a minute.

ac000 avatar May 06 '25 13:05 ac000

Yes changing that setting does make it core dump in just over 60 seconds

Shaun-3adesign avatar May 06 '25 13:05 Shaun-3adesign

OK, thanks for testing, so it certainly seems something related to keep-alive requests, I'm just surprised we haven't seen this before...

So in an attempt to reproduce this in the meantime...

$ telnet localhost 8000
Trying ::1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
Host: localhost:8000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:138.0) Gecko/20100101 Firefox/138.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Priority: u=0, i

HTTP/1.1 200 OK
content-type: text/plain
Server: Unit/1.34.2
Date: Tue, 06 May 2025 15:32:44 GMT
Transfer-Encoding: chunked

b
Testing...

0

/*
 * After 3 minutes
 */

HTTP/1.1 408 Request Timeout
Server: Unit/1.34.2
Connection: close
Content-Length: 0
Date: Tue, 06 May 2025 15:35:44 GMT

Connection closed by foreign host.

So things seem to have worked as expected, After 3 minutes Unit terminated the connection with a 408.

I'll keep prodding...

ac000 avatar May 06 '25 15:05 ac000

the attached zip file contains a project that is able to recreate the issue every time

extract into a folder then run the following make build-server make build-image make deploy-server

this does require docker and will result in a container running in port 443, in a browser go to https://server-ip this should end up displaying a openapi webpage, then wait 60 seconds. this also has you patch included

unit-coredump.zip

Shaun-3adesign avatar May 07 '25 11:05 Shaun-3adesign

Thanks, let me see if I can get it going without docker...

ac000 avatar May 07 '25 15:05 ac000

Quick question, where's the actual application that's being run?

ac000 avatar May 07 '25 16:05 ac000

so the make build-server runs the open api generator which creates all the server code once you run it you will see a folder called server with all that code in it, you can run this outside of docker thou you will need the openapi generator jar file

https://repo1.maven.org/maven2/org/openapitools/openapi-generator-cli/7.0.1/openapi-generator-cli-7.0.1.jar

then run

java -jar /openapi-generator-cli-7.0.1.jar generate \
               -t .openapi-generator-server/ \
               -i openapi.yaml\
               -g python-flask \
               -o server/

Shaun-3adesign avatar May 07 '25 16:05 Shaun-3adesign

OK, thanks, so after a considerable amount of python packages later! I have it running.

Anything specific I need to do?

ac000 avatar May 07 '25 18:05 ac000

basically just open it in a browser

https://localhost

this should redirect to https://localhost/openapi_examples/ui/

then just wait

Shaun-3adesign avatar May 07 '25 18:05 Shaun-3adesign