dnsfwd
dnsfwd copied to clipboard
dnsfwd shutdown after running for a while
So far, dnsfwd has shut down with two different errors, IF it has answered dns queries:
dnsfwd --bind-udp 127.0.0.1:9999 --connect-tcp 127.0.0.1:53 <2>shutdown: Socket is not connected
dnsfwd --bind-udp 127.0.0.1:9999 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 CRIT: shutdown: Socket is not connected
It had managed to stay up for hours, when no query was thrown at it.
How long (approximately) is this? I'll try and see if I can reproduce this. This is using Mac OS X with a local unbound ? I don't have a MacOS X available but I can try on a FreeBSD VM.
Which version of boost are you using?
You might want to try using another upstream TCP DNS server.
A stacktrace of the exception throw point might be useful. With GDB:
catch throw
commands
backtrace
end
with a debug build (CMAKE_BUILD_TYPE=Debug).
According to my unbound logs, this latest run lasted exactly 4 minutes 29 seconds, before unbound stopped receiving queries from dnsfwd.
Apparently had two Cmake and two boost installed:
boost/1.60.0_2/ boost/1.61.0/
cmake/3.6.1 cmake/3.5.2
Removed all of those, and installing cmake 3.6.2 and boost 1.62.0 now.
cmake -D CMAKE_BUILD_TYPE:STRING=Debug .
The resulting debug binary is bigger than the previous one:
398K Nov 2 21:32 ./dnsfwd 304K Oct 31 18:56 /usr/local/bin/dnsfwd
catch throw commands backtrace end
When are those gdb commands supposed to be executed?
I'm getting this, with or without dnsfwd running:
[100%] Built target dnsfwd Install the project... -- Install configuration: "Debug" -- Installing: /usr/local/bin/dnsfwd
gdb $(which dnsfwd) OR sudo gdb --args dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5
GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-apple-darwin15.6.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) catch throw Catchpoint 1 (throw) (gdb) commands Type commands for breakpoint(s) 1, one per line. End with a line saying just "end". (gdb) backtrace No stack. (gdb)
Tried adding this to CMakeLists.txt (according to https://github.com/Samsung/jerryscript/issues/760) and recompiled, to no avail:
option(ENABLE_LTO "Enable LTO build" OFF) option(ENABLE_ALL_IN_ONE "Enable ALL_IN_ONE build" ON) set(PLATFORM_EXT "DARWIN") set(EXTERNAL_BUILD FALSE)
Another run shut itself down in 6 mins 1 sec.
Going to try the OS X equivalent to gdb, called lldb, per https://www.mail-archive.com/[email protected]/msg10784.html
When I run lldb in either of those two ways below, the dnsfwd options such as "--bind-udp 127.0.0.1:53" don't seem to be recognized:
sudo lldb dnsfwd (lldb) target create "dnsfwd" Current executable set to 'dnsfwd' (x86_64). (lldb) r --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 Process 65577 launched: '/usr/local/bin/dnsfwd' (x86_64)
sudo lldb -- /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5
dig +dnssec debian.org @127.0.0.1
; <<>> DiG 9.8.3-P1 <<>> +dnssec debian.org @127.0.0.1 ;; global options: +cmd ;; connection timed out; no servers could be reached
(gdb) backtrace No stack.
This is weird. You shouldn't get a prompt there and the command should not execute immediately.
(gdb) backtrace No stack.
Could that be because of the part earlier?
Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done.
sudo gdb --args /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 ... Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) r Starting program: /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 Unable to find Mach task port for process-id 66660: (os/kern) failure (0x5). (please check gdb is codesigned - see taskgated(8))
sudo gdb --args /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:5353 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 Password: GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-apple-darwin15.6.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/bin/dnsfwd... warning:
/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) r Starting program: /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:5353 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/system/build/darwin-4.2.1/release/error_code.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/cmdline.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/config_file.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/convert.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/options_description.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/parsers.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/positional_options.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/split.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/utf8_codecvt_facet.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/value_semantic.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/variables_map.o': can't open to read symbols: No such file or directory. CRIT: bind: Address already in use Inferior 1 (process 66881) exited with code 01
Well, here's how far I've gotten with lldb, which apparently is not quite like gdb:
http://lldb.llvm.org/lldb-gdb.html http://lldb.llvm.org/tutorial.html https://stackoverflow.com/questions/8122375/breakpoint-on-exceptions-equivalent-of-gdbs-catch-throw https://debugging-with-lldb.blogspot.com/
in one terminal session:
lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor
in another terminal, launch dnsfwd
dnsfwd --bind-udp 127.0.0.1:53530 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5
back in the lldb terminal; at this point, dnsfwd seems to have stopped responding to dns queries, even though it continues to run and doesn't exit (shut down):
Process 78107 stopped
- thread #1: tid = 0x2ad7ffa, 0x00007fff9fd56eca libsystem_kernel.dylib
kevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10 libsystem_kernel.dylib`kevent: -> 0x7fff9fd56eca <+10>: jae 0x7fff9fd56ed4 ; <+20> 0x7fff9fd56ecc <+12>: movq %rax, %rdi 0x7fff9fd56ecf <+15>: jmp 0x7fff9fd507cd ; cerror_nocancel 0x7fff9fd56ed4 <+20>: retq
Executable module set to "/usr/local/bin/dnsfwd". Architecture set to: x86_64-apple-macosx.
"break set -E C++" seems to be equivalent to gdb's "catch throw"
(lldb) break set -E C++ Breakpoint 1: 2 locations.
I haven't found the lldb equivalent to gdb's "commands"; lldb has a "command" command
"bt" is lldb equivalent to gdb's "backtrace"
(lldb) bt
- thread #1: tid = 0x2ad7ffa, 0x00007fff9fd56eca libsystem_kernel.dylib`kevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
- frame #0: 0x00007fff9fd56eca libsystem_kernel.dylib
kevent + 10 frame #1: 0x000000010a2e872f dnsfwdboost::asio::detail::kqueue_reactor::run(bool, boost::asio::detail::op_queueboost::asio::detail::task_io_service_operation&) + 239 frame #2: 0x000000010a2e83f7 dnsfwdboost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 327 frame #3: 0x000000010a2e8065 dnsfwdboost::asio::detail::task_io_service::run(boost::system::error_code&) + 181 frame #4: 0x000000010a2e5174 dnsfwdmain + 9988 frame #5: 0x00007fff90f395ad libdyld.dylibstart + 1 frame #6: 0x00007fff90f395ad libdyld.dylib`start + 1
- frame #0: 0x00007fff9fd56eca libsystem_kernel.dylib
Since dnsfwd has already stopped responding, as soon as lldb attaches, I don't see how to trigger its eventual shutdown, like under normal non-debug conditions.
Got it.
lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 79879 stopped
- thread #1: tid = 0x2ae26fd, 0x00007fff9fd56eca libsystem_kernel.dylib
kevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10 libsystem_kernel.dylib`kevent: -> 0x7fff9fd56eca <+10>: jae 0x7fff9fd56ed4 ; <+20> 0x7fff9fd56ecc <+12>: movq %rax, %rdi 0x7fff9fd56ecf <+15>: jmp 0x7fff9fd507cd ; cerror_nocancel 0x7fff9fd56ed4 <+20>: retq
Executable module set to "/usr/local/bin/dnsfwd". Architecture set to: x86_64-apple-macosx.
(lldb) break set -E C++ Breakpoint 1: 2 locations. (lldb) continue Process 79879 resuming (lldb) bt error: invalid thread
Process 79879 stopped
- thread #1: tid = 0x2ae26fd, 0x00007fff9e63aa01 libc++abi.dylib
__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw libc++abi.dylib`__cxa_throw: -> 0x7fff9e63aa01 <+0>: pushq %rbp 0x7fff9e63aa02 <+1>: movq %rsp, %rbp 0x7fff9e63aa05 <+4>: pushq %r15 0x7fff9e63aa07 <+6>: pushq %r14
(lldb) bt
- thread #1: tid = 0x2ae26fd, 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
-
frame #0: 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw
frame #1: 0x0000000106b025ef dnsfwd
void boost::throw_exception<boost::system::system_error>(boost::system::system_error const&) + 79 frame #2: 0x0000000106b07551 dnsfwddnsfwd::client::reset() + 257frame #3: 0x0000000106b080be dnsfwd`dnsfwd::client::on_message_size(boost::system::error_code const&, unsigned long) + 590
frame #4: 0x0000000106b09857 dnsfwd`boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_serviceboost::asio::ip::tcp >, boost::asio::mutable_buffers_1, boost::asio::detail::transfer_all_t, boost::_bi::bind_t<void, boost::_mfi::mf2<void, dnsfwd::client, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<std::__1::shared_ptr<dnsfwd::client >, boost::arg<1> ()(), boost::arg<2> ()()> > >::operator()(boost::system::error_code const&, unsigned long, int) + 119
frame #5: 0x0000000106b09b9a dnsfwd`boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_serviceboost::asio::ip::tcp >, boost::asio::mutable_buffers_1, boost::asio::detail::transfer_all_t, boost::bi::bind_t<void, boost::mfi::mf2<void, dnsfwd::client, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<std::__1::shared_ptr<dnsfwd::client >, boost::arg<1> ()(), boost::arg<2> ()()> > > >::do_complete(boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_operation, boost::system::error_code const&, unsigned long) + 218
frame #6: 0x0000000106b04569 dnsfwd`boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lockboost::asio::detail::posix_mutex&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 697
frame #7: 0x0000000106b04065 dnsfwd`boost::asio::detail::task_io_service::run(boost::system::error_code&) + 181
frame #8: 0x0000000106b01174 dnsfwd`main + 9988
frame #9: 0x00007fff90f395ad libdyld.dylib`start + 1
-
(lldb) breakpoint list
Current breakpoints:
1: Exception breakpoint (catch: off throw: on) using: names = {'__cxa_throw', '__cxa_rethrow'}, modules(2) = libc++abi.dylib, libSystem.B.dylib, locations = 2, resolved = 2, hit count = 2
1.1: where = libc++abi.dylib__cxa_throw, address = 0x00007fff9e63aa01, resolved, hit count = 2 1.2: where = libc++abi.dylib__cxa_rethrow, address = 0x00007fff9e63ac2c, resolved, hit count = 0
OK, great!
Could you try replacing:
socket_.shutdown(boost::asio::ip::tcp::socket::shutdown_both);
by:
boost::system::error_code ec; socket_.shutdown(boost::asio::ip::tcp::socket::shutdown_both, ec);
in client.cpp?
Great. So far it's been up and running for over 20 mins.
After about an hour, lookups have started failing, and dnsfwd is emitting a lot of "ERR: Reply received not expected":
sudo dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected
lldb is not catching anything:
sudo lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 65437 stopped
- thread #1: tid = 0x2c7015b, 0x00007fff6c27c000 dyld
_dyld_start, stop reason = signal SIGSTOP frame #0: 0x00007fff6c27c000 dyld_dyld_start dyld`_dyld_start: -> 0x7fff6c27c000 <+0>: popq %rdi 0x7fff6c27c001 <+1>: pushq $0x0 0x7fff6c27c003 <+3>: movq %rsp, %rbp 0x7fff6c27c006 <+6>: andq $-0x10, %rsp
Executable module set to "/usr/lib/dyld". Architecture set to: x86_64-apple-macosx. (lldb) break set -E C++ Breakpoint 1: no locations (pending). (lldb) break set -E C++ Breakpoint 2: no locations (pending). (lldb) continue Process 65437 resuming 2 locations added to breakpoint 1 2 locations added to breakpoint 2 (lldb) bt error: invalid thread (lldb) bt error: invalid thread
after dnsfwd stopped responding
(lldb) break set -E C++ Breakpoint 3: 2 locations. (lldb) bt error: invalid thread (lldb)
Wait, this seems temporary, and may be caused by unexpected responses from unbound:
Nov 04 18:11:24 unbound[61562:3] info: reply from <.> 8.8.8.8#53 Nov 04 18:11:24 unbound[61562:3] info: query response was THROWAWAY Nov 04 18:14:23 unbound[61562:3] info: Capsforid: timeouts, starting fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 8.8.4.4#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.1#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.2#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.3#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:28 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:28 unbound[61562:3] info: 127.0.0.1 www.google.com. A IN Nov 04 18:14:28 unbound[61562:3] info: resolving www.google.com. A IN Nov 04 18:14:28 unbound[61562:3] info: response for www.google.com. A IN
Now, lookups are working again.
Did it stop responding for all requests or a specific one?
--loglevel 8 might help by providing some more information about what's happening.
ERR: Reply received not expected
Well this is still weird. Having a tcpdump -w foo.pcap "port 53 or port 5353" might help understand what happens and see if it's a problem from our side (beware of privacy issues before handing a dump of your DNS traffic, though).
I think that episode was temporary, since dnsfwd resumed responding after those 3 minutes, and hadn't repeated it since. I think the original problem I opened this issue for has been resolved.
However, while playing with stunnel to external DNS over TLS servers, I managed to stop dnsfwd again. I will describe it below.
The data flow was like your conceptual design for dnsfwd: dnsfwd (listening on UDP 53) -> unbound (listening on UDP 5353) -> stunnel (listening on TCP 1234) -> external DNS over TLS server(s) (some provide their certs, some don't, more on this below)
This setup worked fine, when/if I had a single instance of stunnel connecting to only one external DNS over TLS server, with its cert.
Then, I tried to configure the single instance of stunnel to connect to several external DNS over TLS servers simultaneously (in a round-robin or random fashion), using a single stunnel.conf. It seems there can be only CAfile in the [dns] section. So I combined the SSL certs of those external DNS over TLS servers in one dns.pem file. (There are some other external servers that I do not have SSL certs for, since their owner-operators do not provide them.) This seems to lead to the relevant external servers to first accept the TLS connection, then quickly close the connections. This scenario eventually made dnsfwd unresponsive:
sudo dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 8 ... DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Reply reception error: asio.misc:2 DEBUG: Client deleted DEBUG: Request received DEBUG: New client DEBUG: Connecting
dnsfwd never did exit or shut down, it still appeared to be running in the terminal, but no DNS queries were getting answered
I had an lldb instance attached for 5+ hours, until this occured:
sudo lldb -d
(lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 69929 stopped
- thread #1: tid = 0x2d25743, 0x00007fff66b35000 dyld
_dyld_start, stop reason = signal SIGSTOP frame #0: 0x00007fff66b35000 dyld_dyld_start dyld`_dyld_start: -> 0x7fff66b35000 <+0>: popq %rdi 0x7fff66b35001 <+1>: pushq $0x0 0x7fff66b35003 <+3>: movq %rsp, %rbp 0x7fff66b35006 <+6>: andq $-0x10, %rsp
Executable module set to "/usr/lib/dyld". Architecture set to: x86_64-apple-macosx. (lldb) break set -E C++ Breakpoint 1: no locations (pending). (lldb) continue Process 69929 resuming 2 locations added to breakpoint 1
Below is where dnsfwd stopped
Process 69929 stopped
- thread #1: tid = 0x2d25743, 0x00007fff9e63aa01 libc++abi.dylib
__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw libc++abi.dylib`__cxa_throw: -> 0x7fff9e63aa01 <+0>: pushq %rbp 0x7fff9e63aa02 <+1>: movq %rsp, %rbp 0x7fff9e63aa05 <+4>: pushq %r15 0x7fff9e63aa07 <+6>: pushq %r14 (lldb) bt - thread #1: tid = 0x2d25743, 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
- frame #0: 0x00007fff9e63aa01 libc++abi.dylib
__cxa_throw frame #1: 0x000000010ca1bf14 dnsfwdvoid boost::throw_exceptionboost::system::system_error(boost::system::system_error const&) + 196 frame #2: 0x000000010ca283f1 dnsfwddnsfwd::server::on_message(boost::system::error_code const&, unsigned long) + 4945 frame #3: 0x000000010ca28bc0 dnsfwdboost::asio::detail::reactive_socket_recvfrom_op<boost::asio::mutable_buffers_1, boost::asio::generic::basic_endpointboost::asio::generic::datagram_protocol, boost::bi::bind_t<void, boost::mfi::mf2<void, dnsfwd::server, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<dnsfwd::server*, boost::arg<1> ()(), boost::arg<2> ()()> > >::do_complete(boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_operation, boost::system::error_code const&, unsigned long) + 128 frame #4: 0x000000010ca20b98 dnsfwdboost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 504 frame #5: 0x000000010ca20755 dnsfwdboost::asio::detail::task_io_service::run(boost::system::error_code&) + 181 frame #6: 0x000000010ca1f80f dnsfwdboost::asio::io_service::run() + 47 frame #7: 0x000000010ca1eda3 dnsfwdmain + 8787 frame #8: 0x00007fff90f395ad libdyld.dylibstart + 1 frame #9: 0x00007fff90f395ad libdyld.dylibstart + 1
- frame #0: 0x00007fff9e63aa01 libc++abi.dylib
I eventually started using separate stunnel instances for each set of external DNS servers that I have certs for (seems obvious, that stunnel requires the pertinent SSL certs for successful connections, unlike unbound). This had its own set of problems, specifically with connecting to the IPv6 addrs of those external servers. It seems (I have DEBUG = 7 on in stunnel.conf), that because I have IPv6 Privacy Extension on, my IPv6 addr keeps changing (although the changes were much more frequent than usual). This seemed to cause the external servers to first accept the SSL connection, only to suddenly close it. My end immediately connected from a new IPv6 addr (for a new query probably), got the SSL cert accepted, only to get "SSL peer suddenly disconnected" again. In the meantime, connections to the IPv4 addr of the same external servers were ok. The end result was that ~50% of the lookups were failing. This caused traffic jam in dnsfwd like "DEBUG: 2 requests dropped, 42 remaining" too, but I didn't wait to see if dnsfwd would eventually stop [responding] or not, since the stunnel-to-external IPv6 servers issue alone was making the performance pretty miserable AND I didn't have lldb attached.
dnsfwd (listening on UDP 53) -> unbound (listening on UDP 5353) -> stunnel (listening on TCP 1234) -> external DNS over TLS server(s)
I guess there's something wrong here:
- dnsfwd can't use unbound as a upstream over UDP, it can use unbound over TCP ;
- you don't want to have unbound use stunnel as a upstream.
This is because AFAIU ubound doesn't recycle TCP connections between requests for when talking to an upstream server over TCP. This opens one connection per request and at some point the remote server stops talking to you. That's in fact, the whole point of dnsfwd.
The correct setup is: (optional) caching server (i.e. unbound) → dnsfwd → stunnel → remote recursive over TLS. You should probably start without a caching server first.
It seems (I have DEBUG = 7 on in stunnel.conf), that because I have IPv6 Privacy Extension on, my IPv6 addr keeps changing (although the changes were much more frequent than usual). This seemed to cause the external servers to first accept the SSL connection, only to suddenly close it.
This is weird, this IPv6 address is not suppoed to happens to frequently and it shouldn't break stuff.
Below is where dnsfwd stopped
Apparently it stops in __cxa_throw? How did you get this backtrace? It looks a lot like the program was running under a debugger with breakpoint on exceptions.