Crashes when `tracing_udp_listener_addr` is set to an invalid DNS entry

Open ArniDagur opened this issue 6 months ago • 0 comments

FoundationDB version: 7.3.57
Environment: Kubernetes

I set my cluster's tracing_udp_listener_addr knob to a domain which does not resolve (dig responds with status: NXDOMAIN)

This caused pods to crash with logs like

{"Severity":"40","ErrorKind":"BugDetected","Time":"1750770510.406294","DateTime":"2025-06-24T13:08:30Z","Type":"Crash","ID":"0000000000000000","Signal":"6","Name":"Aborted","Trace":"addr2line -e fdbserver.debug -p -C -f -i 0x7f1a1df9c730","ThreadID":"4266503881614457221","Backtrace":"addr2line -e fdbserver.debug -p -C -f -i 0x5594e7d 0x5595143 0x558f344 0x555cafb 0x7f1a1df9c730","Machine":"<IP>:4501","LogGroup":"<LOGGROUP>","Roles":"DD"}

(IP and Log Group replaced with placeholders)

In monitor.log I could see

{"level":"info","ts":1750763764.4495609,"msg":"Starting subprocess","processNumber":1,"area":"runProcess","arguments":["/usr/bin/fdbserver","--cluster_file=/var/fdb/data/fdb.cluster","--seed_cluster_file=/var/dynamic-conf/fdb.cluster","--public_address=[<IP>]:4501","--class=stateless","--logdir=/var/log/fdb-trace-logs","--loggroup=<LOGGROUP>","--datadir=/var/fdb/data/1","--locality_process_id=stateless-33705-1","--locality_instance_id=stateless-33705","--locality_machineid=<HOSTNAME>","--locality_zoneid=<ZONE>","--listen_address=[<IP>]:4501","--trace_format=xml","--locality_data_hall=<DATA HALL>","--tracer=network_lossy","--locality_dns_name=<DNS NAME>"]}
{"level":"info","ts":1750763764.4501088,"msg":"Subprocess started","processNumber":1,"area":"runProcess","PID":1085}
{"level":"info","ts":1750763764.4767077,"msg":"Subprocess output","processNumber":1,"area":"runProcess","msg":"FDBD joined cluster.","PID":1085}
{"level":"error","ts":1750763768.102775,"msg":"Subprocess error log","processNumber":1,"area":"runProcess","msg":"libc++abi: terminating due to uncaught exception of type Error","PID":1085}
{"level":"error","ts":1750763768.1028233,"msg":"Subprocess error log","processNumber":1,"area":"runProcess","msg":"SIGNAL: Aborted (6)","PID":1085}
{"level":"error","ts":1750763768.1028285,"msg":"Subprocess error log","processNumber":1,"area":"runProcess","msg":"Trace: addr2line -e fdbserver.debug -p -C -f -i 0x76de6ab4d730","PID":1085}
{"level":"error","ts":1750763768.1710055,"msg":"Error from subprocess","processNumber":1,"area":"runProcess","PID":1085,"error":"signal: aborted"}

I downloaded fdbserver.debug from https://github.com/apple/foundationdb/releases/tag/7.3.57, and got a backtrace:

$ addr2line -e fdbserver.debug -p -C -f -i 0x5594e7d 0x5595143 0x558f344 0x555cafb 0x7459ec65d730

BaseTraceEvent::backtrace(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:?
std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::__is_long[abi:v15006]() const at /usr/local/bin/../include/c++/v1/string:1499
 (inlined by) ~basic_string at /usr/local/bin/../include/c++/v1/string:2333
 (inlined by) BaseTraceEvent::log() at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1326
~BaseTraceEvent at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Trace.cpp:1369
crashHandler(int) at /home/foundationdb_ci/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/flow/Platform.actor.cpp:3709
?? ??:0

When I changed the value of tracing_udp_listener_addr to 127.0.0.1, the processes stopped crashing:

fdb> getknob tracing_udp_listener_addr
`tracing_udp_listener_addr' is `'badsubdomain.mycompany.com''
fdb> begin; setknob tracing_udp_listener_addr 127.0.0.1; commit "fix addr"
>>> begin
Transaction started
>>> setknob tracing_udp_listener_addr 127.0.0.1
>>> commit fix\x20addr
Committed (14)
fdb> getknob tracing_udp_listener_addr
`tracing_udp_listener_addr' is `'127.0.0.1''

(Domain changed from the actual one)

Interestingly enough, the crashes were observed on stateless processes in roles such as DD, RK, CC. The storage nodes seemed fine. Note that this was on a test cluster, so it was not getting any actual traffic besides some requests for status JSON.

Jun 24 '25 15:06 ArniDagur