dhcpcd crash due to double free attempt
This is happening on dhcpcd-10.2.2 version. What seems to happen here is expire and SIGTERM close in time.
And below is pmd full back trace.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill_implementation (threadid=
[**ALERT: The abort() might not be exactly invoked from the following function line. If the trail function contains multiple abort() calls, then you should cross check by other means to get correct abort() call location. This is due to the optimized compilation which hides the debug info for multiple abort() calls in a given function. Refer TR HU16995 for more information]
#4 0x0000007fa543c794 in __libc_message_impl (fmt=fmt@entry=0x7fa551bf00 "%s\n") at /usr/src/debug/glibc/2.39+git/sysdeps/posix/libc_fatal.c:134
#5 0x0000007fa5452ffc in malloc_printerr (str=str@entry=0x7fa5517220 "free(): double free detected in tcache 2") at malloc.c:5772
#6 0x0000007fa54553d8 in _int_free (av=0x7fa5554a40 <main_arena>, p=p@entry=0x55a718a710, have_lock=have_lock@entry=0) at malloc.c:4541
#7 0x0000007fa5457d64 in __GI___libc_free (mem=
(gdb) bt full
#0 __pthread_kill_implementation (threadid=
[**ALERT: The abort() might not be exactly invoked from the following function line. If the trail function contains multiple abort() calls, then you should cross check by other means to get correct abort() call location. This is due to the optimized compilation which hides the debug info for multiple abort() calls in a given function. Refer TR HU16995 for more information]
#4 0x0000007fa543c794 in __libc_message_impl (fmt=fmt@entry=0x7fa551bf00 "%s\n") at /usr/src/debug/glibc/2.39+git/sysdeps/posix/libc_fatal.c:134
ap = {__stack = 0x7fd4559e20, __gr_top = 0x7fd4559e20, __vr_top = 0x7fd4559de0, __gr_offs = -48, __vr_offs = 0}
fd = 2
iov = {{iov_base = 0x7fa5517220, iov_len = 40}, {iov_base = 0x7fa551bf02, iov_len = 1}, {iov_base = 0x30, iov_len = 88}, {iov_base = 0x7fa555a000 <__pthread_keys+14928>, iov_len = 548234702848}, {iov_base = 0x7fa556c5c0, iov_len = 548234676800}, {iov_base = 0x72b1, iov_len = 367875837264}, {iov_base = 0x7fd4559e40, iov_len = 367475065384}}
iovcnt =
(gdb) thread apply all bt
Thread 1 (LWP 8339):
#0 __pthread_kill_implementation (threadid=
Fault seems to be on dhcp6.c:line 4295 on attempt of freeing free(state->send);.
free(state->old);
state->old = state->new;
state->old_len = state->new_len;
state->new = NULL;
state->new_len = 0;
if (drop && state->old &&
(options & DHCPCD_NODROP) != DHCPCD_NODROP)
{
if (reason == NULL)
reason = "STOP6";
script_runreason(ifp, reason);
}
free(state->old);
**free(state->send); ---this seems to be line 4295 where it breaks**
free(state->recv);
free(state);
ifp->if_data[IF_DATA_DHCP6] = NULL;
I suspect this is happening: Expire6 event fires -> Callback 1 runs and eventually calls dhcp6_freedrop() -> triggers netlink through call dhcp6_freedrop_addrs() -> Callback 2 runs and eventually calls dhcp6_freedrop() -> nested dhcp6_freedrop()!
From gdb we can see old and new pointers are set to NULL
(gdb) p *state $2 = {state = 89730220, started = {tv_sec = 8817249710939041503, tv_nsec = 854369100}, IMD = 0, RTC = 1, IRT = 1, MRC = 10, MRT = 30, MRCcallback = 0x557b37b540 <dhcp6_failrequest>, sol_max_rt = 3600, inf_max_rt = 3600, RT = 2020, send = 0x5592caff60, send_len = 176, recv = 0x0, recv_len = 0, new = 0x0, new_len = 0, old = 0x0, old_len = 0, acquired = {tv_sec = 358, tv_nsec = 857515240}, renew = 60, rebind = 120, expire = 4000, unicast = { __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, addrs = {tqh_first = 0x0, tqh_last = 0x5592cacfd8}, lowpl = 3000, leasefile = "/var/lib/dhcpcd/port0.lease6", '\000' <repeats 142 times>, reason = 0x557b38ce88 "BOUND6", lerror = 0, has_no_binding = false, failed = false, new_start = false, auth = {replay = 0, token = 0x0, reconf = 0x0}}
suspicion is that dhcp6_freedrop_addrs() triggers netlink call due to address free and we get Callback 2 triggered before we reached ifp->if_data[IF_DATA_DHCP6] = NULL; :
Callback 2 - dhcpcd_handlelink(): /* Start handling kernel messages for interfaces, addresses and * routes. */ if (eloop_event_add(ctx.eloop, ctx.link_fd, ELE_READ, dhcpcd_handlelink, &ctx) == -1) logerr("%s: eloop_event_add", func);
there is also interesting comment in dhcp6_freedrop() pointing on possible function re-entering
state = D6_STATE(ifp);
if (state) {
_**/* Failure to send the release may cause this function to
* re-enter */**_
if (state->state == DH6S_RELEASE) {
dhcp6_finishrelease(ifp);
return;
}
We see this issue often in our testing loop, did you have time to check backtrace? Thank you in advance
I'm struggling to replicate and scratching my head on the backtrace. It's complex and I don't have too much free time right now. Maybe I can find some time next week.
I had a theory that we had nested call in dhcp6_freedrop(), calling again dhcp6_freedrop() while first call didn't complete memory clean up: Expire6 event fires -> Callback 1 runs and eventually calls dhcp6_freedrop() -> triggers netlink through call dhcp6_freedrop_addrs() -> Callback 2 runs and eventually calls dhcp6_freedrop() -> nested dhcp6_freedrop()!
In dhcp6_freedrop() function we call:
state = D6_STATE(ifp);
which is defined:
#define D6_STATE(ifp)
((struct dhcp6_state *)(ifp)->if_data[IF_DATA_DHCP6])
This will not be null in second dhcp6_freedrop() call, if it happens before line 4298: ifp->if_data[IF_DATA_DHCP6] = NULL;
From bt we can see old i new pointers in state are already set to null
(gdb) p *state $2 = {state = 89730220, started = {tv_sec = 8817249710939041503, tv_nsec = 854369100}, IMD = 0, RTC = 1, IRT = 1, MRC = 10, MRT = 30, MRCcallback = 0x557b37b540 <dhcp6_failrequest>, sol_max_rt = 3600, inf_max_rt = 3600, RT = 2020, send = 0x5592caff60, send_len = 176, recv = 0x0, recv_len = 0, new = 0x0, new_len = 0, old = 0x0, old_len = 0, acquired = {tv_sec = 358, tv_nsec = 857515240}, renew = 60, rebind = 120, expire = 4000, unicast = { __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, addrs = {tqh_first = 0x0, tqh_last = 0x5592cacfd8}, lowpl = 3000, leasefile = "/var/lib/dhcpcd/ribridge0.lease6", '\000' <repeats 142 times>, reason = 0x557b38ce88 "BOUND6", lerror = 0, has_no_binding = false, failed = false, new_start = false, auth = {replay = 0, token = 0x0, reconf = 0x0}}
I suspect that dhcp6_freedrop_addrs() as freeing addresses can trigger netlink callback dhcpcd_handlelink() which will have nested dhcp6_freedrop() second call
Callback 2 - dhcpcd_handlelink(): /* Start handling kernel messages for interfaces, addresses and * routes. */ if (eloop_event_add(ctx.eloop, ctx.link_fd, ELE_READ, dhcpcd_handlelink, &ctx) == -1) logerr("%s: eloop_event_add", func);
Hi, I did more code checking, and as we can see from syslog that EXPIRE happened very close to SIGTERM. I now suspect that it could happen that we have dhcp6_startexpire() calling dhcp6_fail()
static void dhcp6_startexpire(void *arg) { struct interface *ifp;
ifp = arg;
eloop_timeout_delete(ifp->ctx->eloop, dhcp6_sendrebind, ifp);
logerrx("%s: DHCPv6 lease expired", ifp->name);
dhcp6_fail(ifp, true);
}
in dhcp6_fail() we could have this free
free(state->old);
state->old = state->new;
state->old_len = state->new_len;
state->new = NULL;
state->new_len = 0;
and we should have only 1 interface
But ifp->if_data[IF_DATA_DHCP6] is not set to NULL. If get dhcp6_freedrop() called later, could we enter problem?
I don't understand how free(state->send) could be the line causing the issue. If you look at the code it's only set and freed/unset in a few places.
Hi, please read my previous comments and output of backtrace and state variable in the moment of crash. What I suspect is that we have partial state free and not initialized to null, so we end up in case of double memory free attempt. Based on events when the fault happens we need to research dhcpcd code sequence for SIGTERM and EXIPRE. There seems to be racing there in call toward dhcp6_freedrop(). I placed above my suspected sequences, but maybe you will notice something else.
I still can't replicate this. Have built with ASAN and can manually trigger a double free in code to check it works and it does. I have tried taking the interface down, removing addreses, but it just works fine for me so I must be missing something. Can you share your dhcpcd.conf please and maybe a debug log as well to give some context around what dhcpcd is doing before the crash please?
Also, what OS are you using and how is dhcpcd compiled please? Could you post the output of dhcpcd --version?
Hi, problem happens on dhcpcd --version: dhcpcd-10.2.2
Regarding OS we are using Linux 6.1.130
Regarding dhcpcd.conf need to check internally how to share this info. Not sure is this really needed.
From syslog in the time of fault we can see that EXPIRE happened very close to SIGTERM for dhcpcd.
Reason: Program Crash Program: dhcpcd Pid: 8339 Rank: Cold Signal: 6 PMD: /var/log/pmd/pmd-dhcpcd-8339-20250529-084051.tgz
From syslog in time of fault: May 29 08:40:51 dhcpcd[8340]: port0: carrier lost May 29 08:40:51 dhcpcd[8340]: port0: deleting address f9aa:d74e:9a05:c799:1044:d09c:38d4:a2b9 May 29 08:40:51 dhcpcd[8340]: port0: deleting default route via fe80::abcd:1234:5678:9abc May 29 08:40:51 dhcpcd[30042]: sending signal TERM to pid 8339 May 29 08:40:51 dhcpcd[30042]: waiting for pid 8339 to exit May 29 08:40:51 dhcpcd[8340]: received SIGTERM, stopping May 29 08:40:51 dhcpcd[8340]: port0: removing interface
We have delivered internally this patch from attachment, now for several weeks
fault is not visible. Could you consider to have this patch in some coming official dhcpcd version?
We have delivered internally this patch from attachment, now for several weeks
fault is not visible. Could you consider to have this patch in some coming official dhcpcd version?
No that patch won't fly. We need to find out what is being freed and not set to NULL, this patch does not address that.
@Sime-Zupanovic Does the 10.3.0 that was released a few days ago fix the issue?
Same status as in https://github.com/NetworkConfiguration/dhcpcd/issues/519