freeradius-server icon indicating copy to clipboard operation
freeradius-server copied to clipboard

[defect]: ASSERT FAILED src/lib/server/trunk.c[1051]: 0: Trunk request 0 invalid transition INIT -> UNASSIGNED

Open mkhon opened this issue 2 years ago • 12 comments

What type of defect/bug is this?

Crash or memory corruption (segv, abort, etc...)

How can the issue be reproduced?

Use rlm_ldap

Log output from the FreeRADIUS daemon

The problem can not be reproduced with "radiusd -X" (only when running with multiple threads):

Mon Jul  4 13:29:48 2022: Info  : Network 0 - Starting
Mon Jul  4 13:29:48 2022: Info  : Worker 0 - Starting
Mon Jul  4 13:29:48 2022: Info  : Worker 1 - Starting
Mon Jul  4 13:29:48 2022: Error : Failed setting connection option new TLS context: Unknown error
Mon Jul  4 13:29:48 2022: Error : ldap - [2] - Connection initialisation failed
Mon Jul  4 13:29:53 2022: Info  : rlm_ldap - [3] Trunk connection changed state INIT -> CONNECTING
Mon Jul  4 13:29:53 2022: Info  : rlm_ldap (ldap) - Directory type: OpenLDAP
Mon Jul  4 13:29:53 2022: Info  : rlm_ldap (ldap) - Directory supports RFC 4533
Mon Jul  4 13:29:53 2022: Info  : Scheduler created successfully with 1 networks and 2 workers
Mon Jul  4 13:29:53 2022: Info  : post-suid-down capabilities: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,35,36+ep
Mon Jul  4 13:29:53 2022: Info  : Ready to process requests
Mon Jul  4 13:30:01 2022: Info  : rlm_ldap - [4] Trunk connection changed state INIT -> CONNECTING
Mon Jul  4 13:30:01 2022: Info  : rlm_ldap (ldap) - Directory type: OpenLDAP
Mon Jul  4 13:30:01 2022: Info  : rlm_ldap (ldap) - Directory supports RFC 4533
Mon Jul  4 13:30:03 2022: Info  : rlm_ldap - [3] Trunk connection changed state ACTIVE -> CLOSED
Mon Jul  4 13:30:06 2022: Info  : rlm_ldap - [4] Trunk connection changed state ACTIVE -> CLOSED
Mon Jul  4 13:30:22 2022: Info  : rlm_ldap - [5] Trunk connection changed state INIT -> CONNECTING
Mon Jul  4 13:30:32 2022: Info  : rlm_ldap - [6] Trunk connection changed state INIT -> CONNECTING
Mon Jul  4 13:30:35 2022: Info  : rlm_ldap - [5] Trunk connection changed state ACTIVE -> CLOSED
Mon Jul  4 13:30:35 2022: Info  : rlm_ldap - [6] Trunk connection changed state ACTIVE -> CLOSED
ASSERT FAILED src/lib/server/trunk.c[1051]: 0: Trunk request 0 invalid transition INIT -> UNASSIGNED
CAUGHT SIGNAL: Aborted
Backtrace of last 21 frames:
/usr/lib64/freeradius/libfreeradius-util.so(fr_fault+0xe8)[0x7f528c366036]
/usr/lib64/freeradius/libfreeradius-util.so(_fr_assert_fail+0xf1)[0x7f528c366c26]
/usr/lib64/freeradius/libfreeradius-server.so(+0x918af)[0x7f528bc6e8af]
/usr/lib64/freeradius/libfreeradius-server.so(fr_trunk_request_signal_cancel+0x208)[0x7f528bc72c60]
/usr/lib64/freeradius/libfreeradius-ldap.so(+0x7d08)[0x7f5284fb5d08]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x1f580)[0x7f528b947580]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_signal+0x151)[0x7f528b94cbf1]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1ec7c)[0x7f528b711c7c]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1ed64)[0x7f528b711d64]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_timer_run+0x139)[0x7f528c389d70]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_service+0x702)[0x7f528c38a745]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_synchronous+0x228)[0x7f528b94eeb0]
/usr/lib64/freeradius/libfreeradius-ldap.so(fr_ldap_trunk_search+0x1e7)[0x7f5284fb5f50]
/usr/lib64/freeradius/rlm_ldap.so(rlm_ldap_find_user+0x324)[0x7f5285658e08]
/usr/lib64/freeradius/rlm_ldap.so(+0x8ac4)[0x7f5285652ac4]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x2b973)[0x7f528b953973]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret+0x683)[0x7f528b94a22b]
/usr/lib64/freeradius/libfreeradius-io.so(fr_worker+0x2f5)[0x7f528b7145a0]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1b76f)[0x7f528b70e76f]
/lib64/libpthread.so.0(+0x7ea5)[0x7f5289bb5ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f5289467b0d]

Relevant log output from client utilities

No response

Backtrace from LLDB or GDB

No response

mkhon avatar Jul 05 '22 14:07 mkhon

Another one (similar code path and the same reproduction scenario: a lot of requests with LDAP authz coming in). But this time query->treq is NULL

src/lib/server/trunk.c:2052: Type mismatch: name[NULL] expected[fr_trunk_request_t]
(gdb) bt
#0  0x00007ffff514e4fb in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  0x00007ffff7b082de in fr_fault (sig=6) at src/lib/util/debug.c:1051
#2  0x00007ffff7b083b8 in _fr_talloc_fault (
    reason=0x7fffd809ed80 "src/lib/server/trunk.c:2052: Type mismatch: name[NULL] expected[fr_trunk_request_t]") at src/lib/util/debug.c:1093
#3  0x00007ffff652f3d5 in talloc_abort_type_mismatch (expected=<optimized out>, name=<optimized out>, 
    location=<optimized out>) at ../../talloc.c:1584
#4  _talloc_get_type_abort (ptr=0x0, name=0x7ffff74712ed "fr_trunk_request_t", location=<optimized out>)
    at ../../talloc.c:1601
#5  0x00007ffff74139d7 in fr_trunk_request_signal_cancel (treq=0x0) at src/lib/server/trunk.c:2052
#6  0x00007fffeb341d08 in ldap_trunk_query_cancel (request=0x7fffdba93010, action=FR_SIGNAL_CANCEL, 
    uctx=0x7fffd9faf920) at src/lib/ldap/base.c:616
#7  0x00007ffff70e6205 in unlang_function_signal (request=0x7fffdba93010, frame=0x7fffdba933e8, 
    action=FR_SIGNAL_CANCEL) at src/lib/unlang/function.c:80
#8  0x00007ffff70eb876 in frame_signal (limit=0, action=FR_SIGNAL_CANCEL, request=0x7fffdba93010)
    at src/lib/unlang/interpret.c:1028
#9  unlang_interpret_signal (request=0x7fffdba93010, action=FR_SIGNAL_CANCEL)
    at src/lib/unlang/interpret.c:1082
#10 0x00007ffff6eb0c7c in worker_stop_request (request_p=0x7fffe780b5c8) at src/lib/io/worker.c:383
#11 0x00007ffff6eb0d64 in worker_max_request_time (el=0x7fffd8004c80, when=..., uctx=0x7fffd800d500)
    at src/lib/io/worker.c:419
#12 0x00007ffff7b2bfc2 in fr_event_timer_run (el=0x7fffd8004c80, when=0x7fffe780b670)
    at src/lib/util/event.c:2146
#13 0x00007ffff7b2c997 in fr_event_service (el=0x7fffd8004c80) at src/lib/util/event.c:2473
#14 0x00007ffff70edb35 in unlang_interpret_synchronous (el=0x7fffd8004c80, request=0x7fffdba93010)
    at src/lib/unlang/interpret_synchronous.c:236
#15 0x00007fffeb341f50 in fr_ldap_trunk_search (p_result=0x7fffe780c9c4, ctx=0x7fffdba96210, 
    out=0x7fffe780b888, request=0x7fffdba93010, ttrunk=0x7fffd802e460, base_dn=0x1566c30 "dc=viasat,dc=com", 
    scope=2, filter=0x7fffe780bcb0 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe780c578, 
    serverctrls=0x7fffe780b890, clientctrls=0x0, is_async=false) at src/lib/ldap/base.c:723
#16 0x00007fffeb9e4eb4 in rlm_ldap_find_user (inst=0x120daf0, request=0x7fffdba93010, ttrunk=0x7fffd802e460, 
    attrs=0x7fffe780c578, force=true, result=0x7fffe780c9b0, handle=0x7fffe780c9b8, rcode=0x7fffe780c9c4)
    at src/modules/rlm_ldap/user.c:118
#17 0x00007fffeb9deb70 in mod_authorize (p_result=0x7fffdba96228, mctx=0x7fffe780ca50, 
    request=0x7fffdba93010) at src/modules/rlm_ldap/rlm_ldap.c:1166
#18 0x00007ffff70f25f8 in unlang_module (p_result=0x7fffdba931dc, request=0x7fffdba93010, 
    frame=0x7fffdba933a8) at src/lib/unlang/module.c:924
#19 0x00007ffff70e8eb0 in frame_eval (priority=0x7fffdba931d8, result=0x7fffdba931dc, frame=0x7fffdba933a8, 
    request=0x7fffdba93010) at src/lib/unlang/interpret.c:500
#20 unlang_interpret (request=0x7fffdba93010) at src/lib/unlang/interpret.c:684
#21 0x00007ffff6eb35a0 in worker_run_request (start=..., worker=0x7fffd800d500) at src/lib/io/worker.c:1184
#22 fr_worker (worker=0x7fffd800d500) at src/lib/io/worker.c:1369
#23 0x00007ffff6ead76f in fr_schedule_worker_thread (arg=0x1672320) at src/lib/io/schedule.c:242
#24 0x00007ffff5146ea5 in start_thread (arg=0x7fffe7839700) at pthread_create.c:307
#25 0x00007ffff49f8b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) frame 6
#6  0x00007fffeb341d08 in ldap_trunk_query_cancel (request=0x7fffdba93010, action=FR_SIGNAL_CANCEL, 
    uctx=0x7fffd9faf920) at src/lib/ldap/base.c:616
616		fr_trunk_request_signal_cancel(query->treq);
(gdb) print *query
$1 = {node = {left = 0x7ffff7dd3040 <sentinel>, right = 0x7ffff7dd3040 <sentinel>, 
    parent = 0x7ffff7dd3040 <sentinel>, data = 0x7fffd9faf920, colour = BLACK, being_freed = false}, 
  ldap_url = 0x0, dn = 0x1566c30 "dc=viasat,dc=com", {search = {attrs = 0x7fffe780c578, scope = 2, 
      filter = 0x7fffe780bcb0 "(cn=1,6,00:a0:bc:6c:7d:3a)"}, mods = 0x7fffe780c578}, 
  type = LDAP_REQUEST_SEARCH, serverctrls = {{control = 0x0, freeit = false}, {control = 0x0, 
      freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, 
      freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, 
      freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}}, clientctrls = {{
      control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {
      control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {
      control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {control = 0x0, freeit = false}, {
      control = 0x0, freeit = false}}, msgid = 147, treq = 0x0, ldap_conn = 0x7fffd9e6e5b0, ev = 0x0, 
  referral_urls = 0x0, referrals = {offset = 0, type = 0x0, entry = {prev = 0x0, next = 0x0}, 
    num_elements = 0}, referral_depth = 0, referral = 0x0, parser = 0x0, result = 0x7fffd1210c10, 
  ret = LDAP_RESULT_SUCCESS}
(gdb) print *query->ldap_conn
$2 = {handle = 0x7fffdbf84f50, serverctrls = {{control = 0x0, freeit = false} <repeats 11 times>}, 
  clientctrls = {{control = 0x0, freeit = false} <repeats 11 times>}, serverctrls_cnt = 0, 
  clientctrls_cnt = 0, directory = 0x0, config = 0x7fffd802e498, conn = 0x7fffd19cc570, 
  state = FR_LDAP_STATE_RUN, fd = 8, queries = 0x7fffd3235cb0, uctx = 0x0}
(gdb) 

mkhon avatar Jul 19 '22 10:07 mkhon

Another one crash:

ASSERT FAILED /home/ubuntu/work/viasat/freeradius-server/src/lib/unlang/interpret_synchronous.c[65]: request->parent || (request->log.unlang_indent == 0): Request 15635 bad log indentation - expected 0 got 10
CAUGHT SIGNAL: Aborted
Backtrace of last 18 frames:
/usr/lib64/freeradius/libfreeradius-util.so(fr_fault+0x1ba)[0x7f854dafc90a]
/usr/lib64/freeradius/libfreeradius-util.so(_fr_assert_fail+0xf4)[0x7f854dafcbc4]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x26284)[0x7f854d0f6284]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_request_done+0x98)[0x7f854d0f46db]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_signal+0x263)[0x7f854d0f4988]
/usr/lib64/freeradius/libfreeradius-io.so(+0x19365)[0x7f854ceb6365]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_timer_run+0x7f)[0x7f854db1934f]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_service+0x504)[0x7f854db19bd4]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_synchronous+0x228)[0x7f854d0f6b35]
/usr/lib64/freeradius/libfreeradius-ldap.so(fr_ldap_trunk_search+0x14b)[0x7f8545553e9b]
/usr/lib64/freeradius/rlm_ldap.so(rlm_ldap_find_user+0x188)[0x7f8545bf02c8]
/usr/lib64/freeradius/rlm_ldap.so(+0x77f3)[0x7f8545beb7f3]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x2b5f8)[0x7f854d0fb5f8]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret+0x683)[0x7f854d0f1eb0]
/usr/lib64/freeradius/libfreeradius-io.so(fr_worker+0x18d)[0x7f854ceb7bfd]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1563f)[0x7f854ceb263f]
/lib64/libpthread.so.0(+0x7ea5)[0x7f854b151ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f854aa03b0d]
No panic action set

mkhon avatar Jul 19 '22 12:07 mkhon

For the request->log.unlang_indent assertion, the short-term fix is to just delete it. It's a sanity check on indentation for the current request. It doesn't affect how the server operates once the request is freed.

However, it should be fixed. We'd need to know what modules are being run in that section. All of the current code checks for indentation levels being correct, and all of the tests pass. So it's difficult to tell what's different about your system.

alandekok avatar Jul 19 '22 13:07 alandekok

@alandekok I think it's just looking at the free'd memory and the root cause of all ASSERT'ions here is the same. I added more logging and it seems that in case of rlm_ldap_find_user() we are using two request cancellation handlers: _ldap_search_sync_timeout() and ldap_trunk_query_cancel():

Info  : _ldap_search_sync_timeout - query 0x7fffdd5ac130, treq = 0x7fffdc328390
Info  : rlm_ldap - [10] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [9] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [11] Trunk connection changed state INIT -> CONNECTING
Info  : rlm_ldap - [11] Trunk connection changed state ACTIVE -> CLOSED
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Info  : ldap_trunk_query_cancel - query 0x7fffdd5ac130, treq = 0x7fffdc328390, action = 2
Bad talloc magic value - unknown value

talloc abort: Bad talloc magic value - unknown value
CAUGHT SIGNAL: Aborted
Backtrace of last 19 frames:
/usr/lib64/freeradius/libfreeradius-util.so(fr_fault+0x1ba)[0x7ffff7b1f90a]
/usr/lib64/freeradius/libfreeradius-util.so(+0x27acf)[0x7ffff7b1facf]
/lib64/libtalloc.so.2(_talloc_get_type_abort+0xd0)[0x7ffff655d420]
/usr/lib64/freeradius/libfreeradius-server.so(fr_trunk_request_signal_cancel+0x1e)[0x7ffff7432f9e]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x1f205)[0x7ffff7112205]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_signal+0x151)[0x7ffff7117876]
/usr/lib64/freeradius/libfreeradius-io.so(+0x19365)[0x7ffff6ed9365]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_timer_run+0x7f)[0x7ffff7b3c34f]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_service+0x504)[0x7ffff7b3cbd4]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_synchronous+0x228)[0x7ffff7119b35]
/usr/lib64/freeradius/libfreeradius-ldap.so(fr_ldap_trunk_search+0x14b)[0x7fffeb54ce9b]
/usr/lib64/freeradius/rlm_ldap.so(rlm_ldap_find_user+0x188)[0x7fffebbe92c8]
/usr/lib64/freeradius/rlm_ldap.so(+0x77f3)[0x7fffebbe47f3]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x2b5f8)[0x7ffff711e5f8]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret+0x683)[0x7ffff7114eb0]
/usr/lib64/freeradius/libfreeradius-io.so(fr_worker+0x18d)[0x7ffff6edabfd]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1563f)[0x7ffff6ed563f]
/lib64/libpthread.so.0(+0x7ea5)[0x7ffff5174ea5]
/lib64/libc.so.6(clone+0x6d)[0x7ffff4a26b0d]
No panic action set

I think we should not use the latter (ldap_trunk_query_cancel) if request is not async:

	action = unlang_function_push(request, is_async ? NULL : ldap_trunk_query_start, ldap_trunk_query_results,
				      ldap_trunk_query_cancel, is_async ? UNLANG_SUB_FRAME : UNLANG_TOP_FRAME, query);

	if (action == UNLANG_ACTION_FAIL) goto error;

	*out = query;

	/*
	 *	Hack until everything is async
	 */
	if (!is_async) {
		fr_event_timer_t const *ev = NULL;

		fr_time_delta_t timeout = ttrunk->config.res_timeout;

		/*
		 *	Add an event that'll send a cancellation request
		 *	to the request.
		 */
		if (fr_time_delta_ispos(timeout)) {
			if (fr_event_timer_in(ctx, unlang_interpret_event_list(request), &ev, timeout,
					      _ldap_search_sync_timeout, query) < 0) goto error;
		}

mkhon avatar Jul 19 '22 13:07 mkhon

For the request->log.unlang_indent assertion, the short-term fix is to just delete it. It's a sanity check on indentation for the current request. It doesn't affect how the server operates once the request is freed.

However, it should be fixed. We'd need to know what modules are being run in that section. All of the current code checks for indentation levels being correct, and all of the tests pass. So it's difficult to tell what's different about your system.

@alandekok I realized this one is a different one (not related to ldap timeouts) - I filed a separate issue about that: https://github.com/FreeRADIUS/freeradius-server/issues/4622

What information do you need about modules and section? I can share recv handler unlang source

mkhon avatar Jul 19 '22 17:07 mkhon

I believe I found the reason for these use-after-free:

in fr_ldap_trunk_search() and fr_ldap_trunk_modify() fr_ldap_query_t is allocated with ctx = unlang_interpret_frame_talloc_ctx(request) (frame context).

Then ldap_trunk_query_cancel() can be called in this context:

#0  ldap_trunk_query_cancel (request=0x7fffcbeabb50, action=FR_SIGNAL_CANCEL, uctx=0x7fffd9f1a0a0)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:612
#1  0x00007ffff70f31e6 in frame_signal (limit=0, action=FR_SIGNAL_CANCEL, request=0x7fffcbeabb50)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:1028
#2  unlang_interpret_signal (request=request@entry=0x7fffcbeabb50, action=action@entry=FR_SIGNAL_CANCEL)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:1082
#3  0x00007ffff6eba365 in worker_stop_request (request_p=<synthetic pointer>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:419
#4  worker_max_request_time (el=<optimized out>, when=..., uctx=<optimized out>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:419
#5  0x00007ffff7b27865 in fr_event_timer_run (el=0x7fffd8004c80, when=0x7fffe760da50)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/util/event.c:2160
#6  0x00007ffff7b2823a in fr_event_service (el=0x7fffd8004c80)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/util/event.c:2487
#7  0x00007ffff70f40b6 in unlang_interpret_synchronous (el=0x7fffd8004c80, request=0x7fffcbeabb50)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret_synchronous.c:236
#8  0x00007fffeb34306b in fr_ldap_trunk_search (p_result=0x7fffe760e5dc, ctx=0x7fffcbeaed50,
    out=0x7fffe760dd18, request=0x7fffcbeabb50, ttrunk=0x7fffd802e460, base_dn=0x15759d0 "dc=viasat,dc=com",
    scope=2, filter=0x7fffe760dd30 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe760e608,
    serverctrls=0x7fffe760dd20, clientctrls=0x0, is_async=false)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:729
#9  0x00007fffeb9e42c8 in rlm_ldap_find_user (inst=inst@entry=0x1211790,
    request=request@entry=0x7fffcbeabb50, ttrunk=ttrunk@entry=0x7fffd802e460,
    attrs=attrs@entry=0x7fffe760e608, force=force@entry=true, result=result@entry=0x7fffe760e5e8,
    handle=0x7fffe760e5e0, rcode=0x7fffe760e5dc)
    at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/user.c:118
#10 0x00007fffeb9df7f3 in mod_authorize (p_result=0x7fffcbeaed68, mctx=0x7fffe760eeb0,
    request=0x7fffcbeabb50) at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/rlm_ldap.c:1166
#11 0x00007ffff70f6fd1 in unlang_module (p_result=0x7fffcbeabd1c, request=0x7fffcbeabb50,
    frame=0x7fffcbeabee8) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/module.c:924
#12 0x00007ffff70e1169 in frame_eval (priority=0x7fffcbeabd18, result=0x7fffcbeabd1c, frame=0x7fffcbeabee8,
    request=0x7fffcbeabb50) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:500

frame_signal() then will call frame_cleanup() for all the frames in stack and this will deallocate fr_ldap_query_t.

However ldap_trunk_request_mux() and ldap_request_cancel_mux() can be called after this point.

ldap_trunk_request_mux() crash:

Info  : rlm_ldap - [82] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [83] Trunk connection changed state INIT -> CONNECTING
Bad talloc magic value - unknown value

talloc abort: Bad talloc magic value - unknown value
CAUGHT SIGNAL: Aborted
Backtrace of last 17 frames:
/usr/lib64/freeradius/libfreeradius-util.so(fr_fault+0x1ba)[0x7ffff7b1ebfa]
/usr/lib64/freeradius/libfreeradius-util.so(+0x27dbf)[0x7ffff7b1edbf]
/lib64/libtalloc.so.2(_talloc_get_type_abort+0xd0)[0x7ffff6573420]
/usr/lib64/freeradius/libfreeradius-ldap.so(+0xbf12)[0x7fffeb54bf12]
/usr/lib64/freeradius/libfreeradius-server.so(fr_trunk_connection_signal_writable+0x121)[0x7ffff742b151]
/usr/lib64/freeradius/libfreeradius-ldap.so(+0xbc48)[0x7fffeb54bc48]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_service+0x40e)[0x7ffff7b3c1ce]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_synchronous+0x276)[0x7ffff71290b6]
/usr/lib64/freeradius/libfreeradius-ldap.so(fr_ldap_trunk_search+0x1ed)[0x7fffeb547ff1]
/usr/lib64/freeradius/rlm_ldap.so(rlm_ldap_find_user+0x188)[0x7fffebbe92c8]
/usr/lib64/freeradius/rlm_ldap.so(+0x77f3)[0x7fffebbe47f3]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x22fd1)[0x7ffff712bfd1]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret+0x459)[0x7ffff7116169]
/usr/lib64/freeradius/libfreeradius-io.so(fr_worker+0x18d)[0x7ffff6ef0bfd]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1563f)[0x7ffff6eeb63f]
/lib64/libpthread.so.0(+0x7ea5)[0x7ffff518aea5]
/lib64/libc.so.6(clone+0x6d)[0x7ffff4a3cb0d]
No panic action set

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe7841700 (LWP 25709)]
0x00007ffff51924fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
36        return INLINE_SYSCALL (tgkill, 3, pid, THREAD_GETMEM (THREAD_SELF, tid),
Missing separate debuginfos, use: debuginfo-install hiredis-0.12.1-2.el7.x86_64 nwkrad-openssl-libs-1.1.1k-1.el7.x86_64
(gdb) bt
#0  0x00007ffff51924fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  0x00007ffff7b1ebcb in fr_fault (sig=sig@entry=6)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/debug.c:1054
#2  0x00007ffff7b1edbf in _fr_talloc_fault (reason=<optimized out>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/debug.c:1096
#3  0x00007ffff6573420 in talloc_abort_unknown_value () at ../../talloc.c:513
#4  talloc_chunk_from_ptr (ptr=0x7fffdae082c0) at ../../talloc.c:529
#5  __talloc_get_name (ptr=0x7fffdae082c0) at ../../talloc.c:1539
#6  _talloc_get_type_abort (ptr=0x7fffdae082c0, name=0x7fffeb55fc51 "fr_ldap_query_t",
    location=0x7fffeb5601a8 "/home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:646")
    at ../../talloc.c:1596
#7  0x00007fffeb54bf12 in ldap_trunk_request_mux (el=0x7fffd8004c80, tconn=0x7fffdb2819e0,
    conn=0x7fffc9cfb040, uctx=0x7fffd802e460)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:646
#8  0x00007ffff742b151 in trunk_connection_writable (tconn=0x7fffdb2819e0)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/server/trunk.c:2851
#9  fr_trunk_connection_signal_writable (tconn=0x7fffdb2819e0)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/server/trunk.c:3770
#10 0x00007fffeb54bc48 in ldap_conn_writable (el=0x7fffd8004c80, fd=7, flags=1, uctx=0x7fffdb2819e0)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:529
#11 0x00007ffff7b3c1ce in fr_event_service (el=0x7fffd8004c80)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/event.c:2432
#12 fr_event_service (el=el@entry=0x7fffd8004c80)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/event.c:2285
#13 0x00007ffff71290b6 in unlang_interpret_synchronous (el=0x7fffd8004c80, request=0x7fffdae12800)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret_synchronous.c:236
#14 0x00007fffeb547ff1 in fr_ldap_trunk_search (p_result=0x7fffe78141dc, ctx=0x7fffdae15a00,
    out=0x7fffe7813918, request=0x7fffdae12800, ttrunk=0x7fffd802e460, base_dn=0x158a010 "dc=viasat,dc=com",
    scope=2, filter=0x7fffe7813930 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe7814208,
    serverctrls=0x7fffe7813920, clientctrls=0x0, is_async=false)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:728
#15 0x00007fffebbe92c8 in rlm_ldap_find_user (inst=inst@entry=0x121f4d0,
    request=request@entry=0x7fffdae12800, ttrunk=ttrunk@entry=0x7fffd802e460,
    attrs=attrs@entry=0x7fffe7814208, force=force@entry=true, result=result@entry=0x7fffe78141e8,
    handle=0x7fffe78141e0, rcode=0x7fffe78141dc)
    at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/user.c:118
#16 0x00007fffebbe47f3 in mod_authorize (p_result=0x7fffdae15a18, mctx=0x7fffe7814ab0,
    request=0x7fffdae12800) at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/rlm_ldap.c:1166
#17 0x00007ffff712bfd1 in unlang_module (p_result=0x7fffdae129cc, request=0x7fffdae12800,
    frame=0x7fffdae12b98) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/module.c:924
#18 0x00007ffff7116169 in frame_eval (priority=0x7fffdae129c8, result=0x7fffdae129cc, frame=0x7fffdae12b98,
    request=0x7fffdae12800) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:500
#19 unlang_interpret (request=request@entry=0x7fffdae12800)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:684
#20 0x00007ffff6ef0bfd in worker_run_request (start=..., worker=0x7fffd800d500)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1184
#21 fr_worker (worker=0x7fffd800d500) at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1369
#22 0x00007ffff6eeb63f in fr_schedule_worker_thread (arg=<optimized out>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/schedule.c:242
#23 0x00007ffff518aea5 in start_thread (arg=0x7fffe7841700) at pthread_create.c:307
#24 0x00007ffff4a3cb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

ldap_request_cancel_mux() crash:

Bad talloc magic value - unknown value

talloc abort: Bad talloc magic value - unknown value
CAUGHT SIGNAL: Aborted
Backtrace of last 17 frames:
/usr/lib64/freeradius/libfreeradius-util.so(fr_fault+0x1ba)[0x7ffff7b1ebfa]
/usr/lib64/freeradius/libfreeradius-util.so(+0x27dbf)[0x7ffff7b1edbf]
/lib64/libtalloc.so.2(_talloc_get_type_abort+0xd0)[0x7ffff6573420]
/usr/lib64/freeradius/libfreeradius-ldap.so(+0xbba1)[0x7fffeb54cba1]
/usr/lib64/freeradius/libfreeradius-server.so(fr_trunk_connection_signal_writable+0xc5)[0x7ffff742b0f5]
/usr/lib64/freeradius/libfreeradius-ldap.so(+0xbc84)[0x7fffeb54cc84]
/usr/lib64/freeradius/libfreeradius-util.so(fr_event_service+0x40e)[0x7ffff7b3c1ce]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret_synchronous+0x276)[0x7ffff71290b6]
/usr/lib64/freeradius/libfreeradius-ldap.so(fr_ldap_trunk_search+0x1dd)[0x7fffeb549004]
/usr/lib64/freeradius/rlm_ldap.so(rlm_ldap_find_user+0x188)[0x7fffebbe92c8]
/usr/lib64/freeradius/rlm_ldap.so(+0x77f3)[0x7fffebbe47f3]
/usr/lib64/freeradius/libfreeradius-unlang.so(+0x22fd1)[0x7ffff712bfd1]
/usr/lib64/freeradius/libfreeradius-unlang.so(unlang_interpret+0x459)[0x7ffff7116169]
/usr/lib64/freeradius/libfreeradius-io.so(fr_worker+0x18d)[0x7ffff6ef0bfd]
/usr/lib64/freeradius/libfreeradius-io.so(+0x1563f)[0x7ffff6eeb63f]
/lib64/libpthread.so.0(+0x7ea5)[0x7ffff517aea5]
/lib64/libc.so.6(clone+0x6d)[0x7ffff4a2cb0d]
No panic action set

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe7041700 (LWP 17182)]
0x00007ffff51824fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
36        return INLINE_SYSCALL (tgkill, 3, pid, THREAD_GETMEM (THREAD_SELF, tid),
Missing separate debuginfos, use: debuginfo-install hiredis-0.12.1-2.el7.x86_64 nwkrad-openssl-libs-1.1.1k-1.el7.x86_64
(gdb) bt
#0  0x00007ffff51824fb in raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:36
#1  0x00007ffff7b1ebcb in fr_fault (sig=sig@entry=6)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/debug.c:1054
#2  0x00007ffff7b1edbf in _fr_talloc_fault (reason=<optimized out>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/debug.c:1096
#3  0x00007ffff6573420 in talloc_abort_unknown_value () at ../../talloc.c:513
#4  talloc_chunk_from_ptr (ptr=0x7fffd2fa48b0) at ../../talloc.c:529
#5  __talloc_get_name (ptr=0x7fffd2fa48b0) at ../../talloc.c:1539
#6  _talloc_get_type_abort (ptr=0x7fffd2fa48b0, name=0x7fffeb560bd1 "fr_ldap_query_t",
    location=0x7fffeb560e58 "/home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:486")
    at ../../talloc.c:1596
#7  0x00007fffeb54cba1 in ldap_request_cancel_mux (el=0x7fffdc004c80, tconn=0x7fffdea7c5e0,
    conn=0x7fffd2645320, uctx=0x7fffdc02e460)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:486
#8  0x00007ffff742b0f5 in trunk_connection_writable (tconn=0x7fffdea7c5e0)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/server/trunk.c:2846
#9  fr_trunk_connection_signal_writable (tconn=0x7fffdea7c5e0)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/server/trunk.c:3770
#10 0x00007fffeb54cc84 in ldap_conn_writable (el=0x7fffdc004c80, fd=48, flags=1, uctx=0x7fffdea7c5e0)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/connection.c:529
#11 0x00007ffff7b3c1ce in fr_event_service (el=0x7fffdc004c80)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/event.c:2432
#12 fr_event_service (el=el@entry=0x7fffdc004c80)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/util/event.c:2285
#13 0x00007ffff71290b6 in unlang_interpret_synchronous (el=0x7fffdc004c80, request=0x7fffde90a840)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret_synchronous.c:236
#14 0x00007fffeb549004 in fr_ldap_trunk_search (p_result=0x7fffe70135dc, ctx=0x7fffde90da40,
    out=0x7fffe7012d18, request=0x7fffde90a840, ttrunk=0x7fffdc02e460, base_dn=0x1589fe0 "dc=viasat,dc=com",
    scope=2, filter=0x7fffe7012d30 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe7013608,
    serverctrls=0x7fffe7012d20, clientctrls=0x0, is_async=false)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:727
#15 0x00007fffebbe92c8 in rlm_ldap_find_user (inst=inst@entry=0x121f4a0,
    request=request@entry=0x7fffde90a840, ttrunk=ttrunk@entry=0x7fffdc02e460,
    attrs=attrs@entry=0x7fffe7013608, force=force@entry=true, result=result@entry=0x7fffe70135e8,
    handle=0x7fffe70135e0, rcode=0x7fffe70135dc)
    at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/user.c:118
#16 0x00007fffebbe47f3 in mod_authorize (p_result=0x7fffde90da58, mctx=0x7fffe7013eb0,
    request=0x7fffde90a840) at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/rlm_ldap.c:1166
#17 0x00007ffff712bfd1 in unlang_module (p_result=0x7fffde90aa0c, request=0x7fffde90a840,
    frame=0x7fffde90abd8) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/module.c:924
#18 0x00007ffff7116169 in frame_eval (priority=0x7fffde90aa08, result=0x7fffde90aa0c, frame=0x7fffde90abd8,
    request=0x7fffde90a840) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:500
#19 unlang_interpret (request=request@entry=0x7fffde90a840)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:684
#20 0x00007ffff6ef0bfd in worker_run_request (start=..., worker=0x7fffdc00d500)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1184
#21 fr_worker (worker=0x7fffdc00d500) at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1369
#22 0x00007ffff6eeb63f in fr_schedule_worker_thread (arg=<optimized out>)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/schedule.c:242
#23 0x00007ffff517aea5 in start_thread (arg=0x7fffe7041700) at pthread_create.c:307
#24 0x00007ffff4a2cb0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

mkhon avatar Jul 22 '22 10:07 mkhon

Unfortunately, we can't execute the cancellation synchronously because we're waiting on the write I/O handler for the socket to be called.

Reparenting the fr_ldap_query_t is probably a bad idea, because although there probably won't be the performance penalties you'd usually have with reparenting (It's O(N) where N is the number of chunk siblings), we would leave a big chunk of the talloc pool allocated to the unlang stack reserved. As the request_ts go back in a free list, the next network request to use that request_t would start allocating memory in the pool past the point where the ldap_query_t was allocated from. We wouldn't leak memory, but it'd significantly bloat it, and if we're already experiencing cancellations we're probably already dealing with thousands of requests in a semi-stuck state.

I think the best thing to do here, is create a common structure with just the msgid, and stick it at the start of the ldap_query_t. When we need to cancel a request, we allocate one of those structures in the context of the treq, and assign it as a new preq. Cancellation code then just casts the preq to the common structure and everyone's happy. The talloc_get_type_abort call would need to be swapped out to just an assert checking the chunk name.

An alternative would just be to dup the entire fr_ldap_query_t but that seems wasteful...

We'd also need to immediately remove the fr_ldap_query_t from ldap_conn->queries, not sure how that's done at the moment.

@ndptech what do you think of the proposed fix?

arr2036 avatar Jul 22 '22 10:07 arr2036

@arr2036 @ndptech Re: initial bug report

There are several issues in lib/ldap:

  1. two cancellation handlers in fr_ldap_trunk_search(): why do we need to set _ldap_search_sync_timeout() handler at all? This was added in 2a47fc01ba326431171eb1e53d64d0c009facf44. fr_ldap_trunk_modify() does not have this code btw.

What I see (with added logging) is that first _ldap_search_sync_timeout() is called (e.g. for treq->state == FR_TRUNK_REQUEST_STATE_BACKLOG), which calls fr_trunk_request_signal_cancel() which calls fr_trunk_request_free() which overwrites *treq (sets treq->pub.state to FR_TRUNK_REQUEST_STATE_INIT and treq->preq to NULL).

Then worker_max_request_time() is fired which calls ldap_trunk_query_cancel() which calls fr_trunk_request_signal_cancel() again which dies with "invalid transition INIT -> UNASSIGNED".

If I patch this by not using second cancellation handler (e.g. not setting _ldap_search_sync_timeout()) I get query->treq is NULL issue as described above.

  1. Looks like query->treq == NULL issue is caused by a race condition: ldap_trunk_request_demux() may reset query->treq to NULL (I assume it has executed because query->ret == LDAP_RESULT_SUCCESS)

I think we should cancel timeout events (how do I do this?) if the query was cancelled or processed successfully (this should also help with issue 1).

For now I patched this just by exiting ldap_trunk_query_cancel() if query->treq == NULL. However with this patch I get use-after-free issue as described above

  1. Use-after-free for fr_ldap_query_t: this happens when query->treq->state == FR_TRUNK_REQUEST_STATE_SENT when ldap_trunk_query_cancel() is called. In this case fr_trunk_request_signal_cancel() does not deallocate fr_trunk_request_t but the query gets deallocated when frame_cleanup() cleans up the stack frames (fr_ldap_query_t is allocated in a context of a stack frame), so then when ldap_request_cancel_mux() is called, treq->preq points to freed memory.

If I patch this one by using fr_trunk talloc context (the same lifetime as fr_trunk_request_t) I still get use-after-free in ldap_request_cancel_mux()

  1. This use-after-free in ldap_request_cancel_mux() occurs because treq is entered FR_TRUNK_REQUEST_STATE_CANCEL in ldap_trunk_query_cancel() and placed to tconn->cancel. Then ldap_request_cancel_mux() gets this treq from tconn->cancel and calls fr_trunk_request_signal_cancel_complete(). But fr_trunk_request_signal_cancel_complete() calls trunk_request_remove_from_conn() (via trunk_request_enter_cancel_complete()) only for treq->state == FR_TRUNK_REQUEST_STATE_CANCEL_SENT. So this treq is not removed from tconn->cancel and fr_trunk_connection_pop_cancellation() returns the same fr_trunk_request_t on the next while iteration in ldap_request_cancel_mux() but its treq->preq is already free'd on previous iteration.

mkhon avatar Jul 22 '22 14:07 mkhon

I should note that with the patches from #4628 I no longer see the crashes and aborts. But after several million requests with LDAP authorization all the workers got stuck and do not process incoming requests.

I do not have analysis for this yet. Some observations are:

...
Info  : rlm_ldap - [377] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [379] Trunk connection changed state INIT -> CONNECTING
Info  : global - ldap_trunk_query_cancel: query 0x7fff1095d050, treq = 0x7fffdd8cfe30, action = 2
Info  : global - ldap_trunk_query_cancel: query 0x7fff1095d050, treq = 0x7fffdd8cfe30, treq->state = 16
Info  : rlm_ldap - [378] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [379] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [380] Trunk connection changed state INIT -> CONNECTING
Info  : rlm_ldap - [380] Trunk connection changed state ACTIVE -> CLOSED
Info  : rlm_ldap - [381] Trunk connection changed state INIT -> CONNECTING
Info  : rlm_ldap - [381] Trunk connection changed state ACTIVE -> CLOSED
<--- this is the point workers stopped processing incoming requests
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer
Error : Network 0 - max_outstanding reached - dropping packet: No memory available in ring buffer

If after this I set fr_debug_lvl=2 and send a single request I get this log:

Fri Jul 22 20:28:28 2022: Debug : Main loop waking up in 0.999995504 seconds
Fri Jul 22 20:28:29 2022: Debug : Main loop waking up in 0.999995312 seconds
Fri Jul 22 20:28:29 2022: Debug : Network 0 - 1 event(s) pending
Fri Jul 22 20:28:29 2022: Debug : Network 0 - Reading data from FD 42
Fri Jul 22 20:28:29 2022: Debug : proto_dhcpv4_udp - Received Discover XID 30bcceb1 length 326 dhcpv4_udp serv
er 127.0.0.1 port 6767 on interface lo
Fri Jul 22 20:28:29 2022: Debug : Network 0 - Read 326 byte(s) from FD 42
Fri Jul 22 20:28:29 2022: Debug : Network 0 - Gathering events - will wait
Fri Jul 22 20:28:29 2022: Debug : 1 event(s) pending
Fri Jul 22 20:28:29 2022: Debug : Worker 1 - Channel data-to-responder
Fri Jul 22 20:28:29 2022: Debug : Worker 1 - Received request 1564028
Fri Jul 22 20:28:29 2022: Debug : (2163624)  proto_dhcpv4 decode packet
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0000: 01 01 06 00 30 bc ce b1 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0010: 00 00 00 00 00 00 00 00 00 00 00 00 01 01 bc 11
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0020: 22 33 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00e0: 00 00 00 00 00 00 00 00 00 00 00 00 63 82 53 63
Fri Jul 22 20:28:29 2022: Debug : (2163624)  00f0: 35 01 01 3d 06 01 01 bc 11 22 33 37 05 2b 29 01
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0100: 03 06 39 02 05 dc 3c 09 76 69 61 73 61 74 31 2e
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0110: 30 52 31 06 1f 30 31 30 31 42 43 31 31 32 32 33
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0120: 33 40 61 75 74 2e 72 65 73 2e 76 69 61 73 61 74
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0130: 2e 63 6f 6d 02 06 01 01 bc 11 22 33 01 06 00 a0
Fri Jul 22 20:28:29 2022: Debug : (2163624)  0140: bc 33 22 11 ff 00
Fri Jul 22 20:28:29 2022: Debug : Worker 1 - Resetting cleanup timer to +30
Fri Jul 22 20:28:29 2022: Debug : No pending requests (1 yielded)
Fri Jul 22 20:28:29 2022: Debug : Gathering events - will wait
Fri Jul 22 20:28:29 2022: Debug : 1 event(s) pending
Fri Jul 22 20:28:29 2022: ERROR : ERROR : (2163623)  Request has reached max_request_time - signalling it to s
top
Fri Jul 22 20:28:29 2022: Debug : (2163623)  Done request
Fri Jul 22 20:28:29 2022: Debug : (2163623)  Finished request
Fri Jul 22 20:28:29 2022: Debug : No pending requests (1 yielded)
Fri Jul 22 20:28:29 2022: Debug : Gathering events - will wait
Fri Jul 22 20:28:29 2022: Debug : Network 0 - 1 event(s) pending
Fri Jul 22 20:28:29 2022: Debug : Network 0 - Channel data-to-requestor
Fri Jul 22 20:28:29 2022: Debug : proto_dhcpv4_udp - cleaning up request in 5.000000s
Fri Jul 22 20:28:29 2022: Debug : Network 0 - Gathering events - will wait
Fri Jul 22 20:28:30 2022: Debug : Main loop waking up in 0.999994897 seconds
Fri Jul 22 20:28:31 2022: Debug : Main loop waking up in 0.999995215 seconds
Fri Jul 22 20:28:32 2022: Debug : Main loop waking up in 0.999995753 seconds
Fri Jul 22 20:28:33 2022: Debug : Main loop waking up in 0.999995094 seconds
Fri Jul 22 20:28:34 2022: Debug : Main loop waking up in 0.999995212 seconds
Fri Jul 22 20:28:34 2022: Debug : Network 0 - 1 event(s) pending
Fri Jul 22 20:28:34 2022: Debug : TIMER - proto_dhcpv4_udp - cleanup delay
Fri Jul 22 20:28:34 2022: Debug : Network 0 - Gathering events - will wait
Fri Jul 22 20:28:35 2022: Debug : Main loop waking up in 0.999995259 seconds
Fri Jul 22 20:28:36 2022: Debug : Main loop waking up in 0.999995316 seconds

Some gdb output:

(gdb) info thr                                                                                     [115/98469]
  Id   Target Id         Frame
  8    Thread 0x7fffe6e3c700 (LWP 20906) "radiusd" 0x00007ffff49ece9f in __GI_ppoll (fds=0x7fffe6e0d8f4,
    nfds=1, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56
  7    Thread 0x7fffe763d700 (LWP 20905) "radiusd" 0x00007ffff49f80e3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  6    Thread 0x7fffe7e3e700 (LWP 20904) "radiusd" 0x00007ffff49f80e3 in epoll_wait ()
    at ../sysdeps/unix/syscall-template.S:81
  5    Thread 0x7ffff1e3a700 (LWP 20903) "libkqueue_mon" 0x00007ffff493058a in do_sigwaitinfo (
    info=0x7ffff1e0d010, set=0x7ffff1e0cf90) at ../sysdeps/unix/sysv/linux/sigwaitinfo.c:54
  4    Thread 0x7ffff1639700 (LWP 20902) "libkqueue_wait" 0x00007ffff493058a in do_sigwaitinfo (
    info=0x7ffff160bfe0, set=0x7ffff160bf60) at ../sysdeps/unix/sysv/linux/sigwaitinfo.c:54
* 1    Thread 0x7ffff7fdf580 (LWP 20895) "radiusd" 0x00007ffff49ece9f in __GI_ppoll (fds=0x7fffffffe1e4,
    nfds=1, timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56
(gdb) thr 8
[Switching to thread 8 (Thread 0x7fffe6e3c700 (LWP 20906))]
#0  0x00007ffff49ece9f in __GI_ppoll (fds=0x7fffe6e0d8f4, nfds=1, timeout=<optimized out>, sigmask=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:56
56            result = INLINE_SYSCALL (ppoll, 5, fds, nfds, timeout, sigmask,
(gdb) bt
#0  0x00007ffff49ece9f in __GI_ppoll (fds=0x7fffe6e0d8f4, nfds=1, timeout=<optimized out>, sigmask=0x0)
    at ../sysdeps/unix/sysv/linux/ppoll.c:56
#1  0x00007ffff6323021 in linux_kevent_wait_hires (kq=0x7fffdc00b220, timeout=0x7fffe6e0dab0)
    at /usr/src/debug/libkqueue-2.6.1/src_0/linux/platform.c:635
#2  0x00007ffff632317f in linux_kevent_wait (kq=0x7fffdc00b220, nevents=256, ts=0x7fffe6e0dab0)
    at /usr/src/debug/libkqueue-2.6.1/src_0/linux/platform.c:668
#3  0x00007ffff631d145 in kevent (kqfd=36, changelist=0x7ffff65353a0 <null_kev>, nchanges=0,
    eventlist=0x7fffdc004d58, nevents=256, timeout=0x7fffe6e0dab0)
    at /usr/src/debug/libkqueue-2.6.1/src_0/common/kevent.c:445
#4  0x00007ffff7b2788f in fr_event_corral (el=0x7fffdc004c80, now=..., wait=true)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/util/event.c:2245
#5  0x00007ffff70f4088 in unlang_interpret_synchronous (el=0x7fffdc004c80, request=0x7fff118c8470)
    at src/freeradius-devel/util/time.h:143
#6  0x00007fffeb343f35 in fr_ldap_trunk_search (p_result=0x7fffe6e0e5dc, ctx=0x7fff118cb670,
    out=0x7fffe6e0dd18, request=0x7fff118c8470, ttrunk=0x7fffdc02e460, base_dn=0x1575990 "dc=viasat,dc=com",
    scope=2, filter=0x7fffe6e0dd30 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe6e0e608,
    serverctrls=0x7fffe6e0dd20, clientctrls=0x0, is_async=false)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:705
#7  0x00007fffeb9e42c8 in rlm_ldap_find_user (inst=inst@entry=0x1211750,
    request=request@entry=0x7fff118c8470, ttrunk=ttrunk@entry=0x7fffdc02e460,
    attrs=attrs@entry=0x7fffe6e0e608, force=force@entry=true, result=result@entry=0x7fffe6e0e5e8,
    handle=0x7fffe6e0e5e0, rcode=0x7fffe6e0e5dc)
    at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/user.c:118
#8  0x00007fffeb9df7f3 in mod_authorize (p_result=0x7fff118cb688, mctx=0x7fffe6e0eeb0,
    request=0x7fff118c8470) at /usr/src/debug/freeradius-server-4.0.0/src/modules/rlm_ldap/rlm_ldap.c:1166
#9  0x00007ffff70f6fd1 in unlang_module (p_result=0x7fff118c863c, request=0x7fff118c8470,
    frame=0x7fff118c8808) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/module.c:924
#10 0x00007ffff70e1169 in frame_eval (priority=0x7fff118c8638, result=0x7fff118c863c, frame=0x7fff118c8808,
    request=0x7fff118c8470) at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:500
#11 unlang_interpret (request=request@entry=0x7fff118c8470)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/unlang/interpret.c:684
#12 0x00007ffff6ebbbfd in worker_run_request (start=..., worker=0x7fffdc00d500)
    at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1184
#13 fr_worker (worker=0x7fffdc00d500) at /usr/src/debug/freeradius-server-4.0.0/src/lib/io/worker.c:1369
#14 0x00007ffff6eb663f in fr_schedule_worker_thread (arg=<optimized out>)
---Type <return> to continue, or q <return> to quit---q
 at /usr/src/debug/freeradius-server-Quit
(gdb) fr 6
#6  0x00007fffeb343f35 in fr_ldap_trunk_search (p_result=0x7fffe6e0e5dc, ctx=0x7fff118cb670,
    out=0x7fffe6e0dd18, request=0x7fff118c8470, ttrunk=0x7fffdc02e460, base_dn=0x1575990 "dc=viasat,dc=com",
    scope=2, filter=0x7fffe6e0dd30 "(cn=1,6,00:a0:bc:6c:7d:3a)", attrs=0x7fffe6e0e608,
    serverctrls=0x7fffe6e0dd20, clientctrls=0x0, is_async=false)
    at /home/ubuntu/work/viasat/freeradius-server/src/lib/ldap/base.c:705
(gdb) print *ttrunk
$15 = {node = {left = 0x4242424242424242, right = 0x4242424242424242, parent = 0x4242424242424242,
    data = 0x4242424242424242, colour = (unknown: 1111638594), being_freed = 66},
  uri = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
  bind_dn = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>, config = {
    name = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    server = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>, server_str = 0x4242424242424242,
    port = 16962, admin_identity = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    admin_password = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>, admin_sasl = {
      mech = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
      proxy = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
      realm = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>},
    sasl_secprops = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>, dereference = 1111638594,
    dereference_str = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>, chase_referrals = 66,
    chase_referrals_unset = 66, use_referral_credentials = 66, referral_depth = 16962, rebind = 66,
    tls_mode = 1111638594, start_tls = 66,
    tls_ca_file = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_ca_path = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_certificate_file = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_private_key_file = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_require_cert_str = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_require_cert = 1111638594,
    tls_min_version_str = 0x4242424242424242 <Address 0x4242424242424242 out of bounds>,
    tls_min_version = 1111638594, keepalive_idle = {value = 4774451407313060418},
    keepalive_probes = 1111638594, keepalive_interval = {value = 4774451407313060418}, srv_timelimit = {
      value = 4774451407313060418}, res_timeout = {value = 4774451407313060418}, net_timeout = {
      value = 4774451407313060418}, tls_handshake_timeout = {value = 4774451407313060418},
    reconnection_delay = {value = 4774451407313060418}, idle_timeout = {value = 4774451407313060418}},
  directory = 0x4242424242424242, trunk = 0x4242424242424242, t = 0x4242424242424242,
  ev = 0x4242424242424242}
(gdb)

So worker thread is still doing fr_ldap_trunk_search() but its ttrunk has already been free'd.

mkhon avatar Jul 22 '22 21:07 mkhon

This is the issue the timeout event was added to prevent. I was seeing exactly the same deadlocks on a production server.

arr2036 avatar Jul 23 '22 11:07 arr2036

This is the issue the timeout event was added to prevent. I was seeing exactly the same deadlocks on a production server.

@arr2036 which issue? Seems that the link is omitted

mkhon avatar Jul 27 '22 09:07 mkhon

@arr2036 if I am not mistaken most of the issues here are fixed now except use-after-free for fr_ldap_query_t. If I am not mistaken you mentioned you will provide a better fix for that (than changing allocation context - the patch we are currently using in our staging to fix that issue, but it is not 100% correct).

Do you have any updates on this?

mkhon avatar Sep 06 '22 10:09 mkhon