dnsfwd icon indicating copy to clipboard operation
dnsfwd copied to clipboard

dnsfwd shutdown after running for a while

Open vollkommenheit opened this issue 9 years ago • 18 comments

So far, dnsfwd has shut down with two different errors, IF it has answered dns queries:

dnsfwd --bind-udp 127.0.0.1:9999 --connect-tcp 127.0.0.1:53 <2>shutdown: Socket is not connected

dnsfwd --bind-udp 127.0.0.1:9999 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 CRIT: shutdown: Socket is not connected

It had managed to stay up for hours, when no query was thrown at it.

vollkommenheit avatar Nov 02 '16 01:11 vollkommenheit

How long (approximately) is this? I'll try and see if I can reproduce this. This is using Mac OS X with a local unbound ? I don't have a MacOS X available but I can try on a FreeBSD VM.

Which version of boost are you using?

You might want to try using another upstream TCP DNS server.

A stacktrace of the exception throw point might be useful. With GDB:

catch throw
commands
backtrace
end

with a debug build (CMAKE_BUILD_TYPE=Debug).

randomstuff avatar Nov 02 '16 22:11 randomstuff

According to my unbound logs, this latest run lasted exactly 4 minutes 29 seconds, before unbound stopped receiving queries from dnsfwd.

vollkommenheit avatar Nov 03 '16 00:11 vollkommenheit

Apparently had two Cmake and two boost installed:

boost/1.60.0_2/ boost/1.61.0/

cmake/3.6.1 cmake/3.5.2

Removed all of those, and installing cmake 3.6.2 and boost 1.62.0 now.

cmake -D CMAKE_BUILD_TYPE:STRING=Debug .

The resulting debug binary is bigger than the previous one:

398K Nov 2 21:32 ./dnsfwd 304K Oct 31 18:56 /usr/local/bin/dnsfwd

vollkommenheit avatar Nov 03 '16 01:11 vollkommenheit

catch throw commands backtrace end

When are those gdb commands supposed to be executed?

I'm getting this, with or without dnsfwd running:

[100%] Built target dnsfwd Install the project... -- Install configuration: "Debug" -- Installing: /usr/local/bin/dnsfwd

gdb $(which dnsfwd) OR sudo gdb --args dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5

GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-apple-darwin15.6.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) catch throw Catchpoint 1 (throw) (gdb) commands Type commands for breakpoint(s) 1, one per line. End with a line saying just "end". (gdb) backtrace No stack. (gdb)

Tried adding this to CMakeLists.txt (according to https://github.com/Samsung/jerryscript/issues/760) and recompiled, to no avail:

option(ENABLE_LTO "Enable LTO build" OFF) option(ENABLE_ALL_IN_ONE "Enable ALL_IN_ONE build" ON) set(PLATFORM_EXT "DARWIN") set(EXTERNAL_BUILD FALSE)

Another run shut itself down in 6 mins 1 sec.

vollkommenheit avatar Nov 03 '16 02:11 vollkommenheit

Going to try the OS X equivalent to gdb, called lldb, per https://www.mail-archive.com/[email protected]/msg10784.html

When I run lldb in either of those two ways below, the dnsfwd options such as "--bind-udp 127.0.0.1:53" don't seem to be recognized:

sudo lldb dnsfwd (lldb) target create "dnsfwd" Current executable set to 'dnsfwd' (x86_64). (lldb) r --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 Process 65577 launched: '/usr/local/bin/dnsfwd' (x86_64)

sudo lldb -- /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5

dig +dnssec debian.org @127.0.0.1

; <<>> DiG 9.8.3-P1 <<>> +dnssec debian.org @127.0.0.1 ;; global options: +cmd ;; connection timed out; no servers could be reached

vollkommenheit avatar Nov 03 '16 21:11 vollkommenheit

(gdb) backtrace No stack.

This is weird. You shouldn't get a prompt there and the command should not execute immediately.

randomstuff avatar Nov 03 '16 22:11 randomstuff

(gdb) backtrace No stack.

Could that be because of the part earlier?

Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done.

vollkommenheit avatar Nov 03 '16 22:11 vollkommenheit

sudo gdb --args /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 ... Reading symbols from /usr/local/bin/dnsfwd... warning: `/tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) r Starting program: /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 Unable to find Mach task port for process-id 66660: (os/kern) failure (0x5). (please check gdb is codesigned - see taskgated(8))

sudo gdb --args /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:5353 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 Password: GNU gdb (GDB) 7.12 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-apple-darwin15.6.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/bin/dnsfwd... warning: /tmp/lto.o': can't open to read symbols: No such file or directory. (no debugging symbols found)...done. (gdb) r Starting program: /usr/local/bin/dnsfwd --bind-udp 127.0.0.1:5353 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5 warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/system/build/darwin-4.2.1/release/error_code.o': can't open to read symbols: No such file or directory. warning: /private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/cmdline.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/config_file.o': can't open to read symbols: No such file or directory. warning: /private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/convert.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/options_description.o': can't open to read symbols: No such file or directory. warning: /private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/parsers.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/positional_options.o': can't open to read symbols: No such file or directory. warning: /private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/split.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/utf8_codecvt_facet.o': can't open to read symbols: No such file or directory. warning: /private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/value_semantic.o': can't open to read symbols: No such file or directory. warning:/private/tmp/boost-20161007-83343-ugak3n/boost_1_62_0/bin.v2/libs/program_options/build/darwin-4.2.1/release/variables_map.o': can't open to read symbols: No such file or directory. CRIT: bind: Address already in use Inferior 1 (process 66881) exited with code 01

vollkommenheit avatar Nov 03 '16 23:11 vollkommenheit

Well, here's how far I've gotten with lldb, which apparently is not quite like gdb:

http://lldb.llvm.org/lldb-gdb.html http://lldb.llvm.org/tutorial.html https://stackoverflow.com/questions/8122375/breakpoint-on-exceptions-equivalent-of-gdbs-catch-throw https://debugging-with-lldb.blogspot.com/

in one terminal session:

lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor

in another terminal, launch dnsfwd

dnsfwd --bind-udp 127.0.0.1:53530 --connect-tcp 127.0.0.1:53 --logformat human --loglevel 5

back in the lldb terminal; at this point, dnsfwd seems to have stopped responding to dns queries, even though it continues to run and doesn't exit (shut down):

Process 78107 stopped

  • thread #1: tid = 0x2ad7ffa, 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10 libsystem_kernel.dylib`kevent: -> 0x7fff9fd56eca <+10>: jae 0x7fff9fd56ed4 ; <+20> 0x7fff9fd56ecc <+12>: movq %rax, %rdi 0x7fff9fd56ecf <+15>: jmp 0x7fff9fd507cd ; cerror_nocancel 0x7fff9fd56ed4 <+20>: retq

Executable module set to "/usr/local/bin/dnsfwd". Architecture set to: x86_64-apple-macosx.

"break set -E C++" seems to be equivalent to gdb's "catch throw"

(lldb) break set -E C++ Breakpoint 1: 2 locations.

I haven't found the lldb equivalent to gdb's "commands"; lldb has a "command" command

"bt" is lldb equivalent to gdb's "backtrace"

(lldb) bt

  • thread #1: tid = 0x2ad7ffa, 0x00007fff9fd56eca libsystem_kernel.dylib`kevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    • frame #0: 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10 frame #1: 0x000000010a2e872f dnsfwdboost::asio::detail::kqueue_reactor::run(bool, boost::asio::detail::op_queueboost::asio::detail::task_io_service_operation&) + 239 frame #2: 0x000000010a2e83f7 dnsfwdboost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 327 frame #3: 0x000000010a2e8065 dnsfwdboost::asio::detail::task_io_service::run(boost::system::error_code&) + 181 frame #4: 0x000000010a2e5174 dnsfwdmain + 9988 frame #5: 0x00007fff90f395ad libdyld.dylibstart + 1 frame #6: 0x00007fff90f395ad libdyld.dylib`start + 1

Since dnsfwd has already stopped responding, as soon as lldb attaches, I don't see how to trigger its eventual shutdown, like under normal non-debug conditions.

vollkommenheit avatar Nov 04 '16 04:11 vollkommenheit

Got it.

lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 79879 stopped

  • thread #1: tid = 0x2ae26fd, 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x00007fff9fd56eca libsystem_kernel.dylibkevent + 10 libsystem_kernel.dylib`kevent: -> 0x7fff9fd56eca <+10>: jae 0x7fff9fd56ed4 ; <+20> 0x7fff9fd56ecc <+12>: movq %rax, %rdi 0x7fff9fd56ecf <+15>: jmp 0x7fff9fd507cd ; cerror_nocancel 0x7fff9fd56ed4 <+20>: retq

Executable module set to "/usr/local/bin/dnsfwd". Architecture set to: x86_64-apple-macosx.

(lldb) break set -E C++ Breakpoint 1: 2 locations. (lldb) continue Process 79879 resuming (lldb) bt error: invalid thread

Process 79879 stopped

  • thread #1: tid = 0x2ae26fd, 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw libc++abi.dylib`__cxa_throw: -> 0x7fff9e63aa01 <+0>: pushq %rbp 0x7fff9e63aa02 <+1>: movq %rsp, %rbp 0x7fff9e63aa05 <+4>: pushq %r15 0x7fff9e63aa07 <+6>: pushq %r14

(lldb) bt

  • thread #1: tid = 0x2ae26fd, 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    • frame #0: 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw

      frame #1: 0x0000000106b025ef dnsfwdvoid boost::throw_exception<boost::system::system_error>(boost::system::system_error const&) + 79 frame #2: 0x0000000106b07551 dnsfwddnsfwd::client::reset() + 257

      frame #3: 0x0000000106b080be dnsfwd`dnsfwd::client::on_message_size(boost::system::error_code const&, unsigned long) + 590

      frame #4: 0x0000000106b09857 dnsfwd`boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_serviceboost::asio::ip::tcp >, boost::asio::mutable_buffers_1, boost::asio::detail::transfer_all_t, boost::_bi::bind_t<void, boost::_mfi::mf2<void, dnsfwd::client, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<std::__1::shared_ptr<dnsfwd::client >, boost::arg<1> ()(), boost::arg<2> ()()> > >::operator()(boost::system::error_code const&, unsigned long, int) + 119

      frame #5: 0x0000000106b09b9a dnsfwd`boost::asio::detail::reactive_socket_recv_op<boost::asio::mutable_buffers_1, boost::asio::detail::read_op<boost::asio::basic_stream_socket<boost::asio::ip::tcp, boost::asio::stream_socket_serviceboost::asio::ip::tcp >, boost::asio::mutable_buffers_1, boost::asio::detail::transfer_all_t, boost::bi::bind_t<void, boost::mfi::mf2<void, dnsfwd::client, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<std::__1::shared_ptr<dnsfwd::client >, boost::arg<1> ()(), boost::arg<2> ()()> > > >::do_complete(boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_operation, boost::system::error_code const&, unsigned long) + 218

      frame #6: 0x0000000106b04569 dnsfwd`boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lockboost::asio::detail::posix_mutex&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 697

      frame #7: 0x0000000106b04065 dnsfwd`boost::asio::detail::task_io_service::run(boost::system::error_code&) + 181

      frame #8: 0x0000000106b01174 dnsfwd`main + 9988

      frame #9: 0x00007fff90f395ad libdyld.dylib`start + 1

(lldb) breakpoint list Current breakpoints: 1: Exception breakpoint (catch: off throw: on) using: names = {'__cxa_throw', '__cxa_rethrow'}, modules(2) = libc++abi.dylib, libSystem.B.dylib, locations = 2, resolved = 2, hit count = 2 1.1: where = libc++abi.dylib__cxa_throw, address = 0x00007fff9e63aa01, resolved, hit count = 2 1.2: where = libc++abi.dylib__cxa_rethrow, address = 0x00007fff9e63ac2c, resolved, hit count = 0

vollkommenheit avatar Nov 04 '16 05:11 vollkommenheit

OK, great!

Could you try replacing:

socket_.shutdown(boost::asio::ip::tcp::socket::shutdown_both);

by:

boost::system::error_code ec; socket_.shutdown(boost::asio::ip::tcp::socket::shutdown_both, ec);

in client.cpp?

randomstuff avatar Nov 04 '16 20:11 randomstuff

Great. So far it's been up and running for over 20 mins.

vollkommenheit avatar Nov 04 '16 21:11 vollkommenheit

After about an hour, lookups have started failing, and dnsfwd is emitting a lot of "ERR: Reply received not expected":

sudo dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 5 ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected ERR: Reply received not expected

lldb is not catching anything:

sudo lldb -d (lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 65437 stopped

  • thread #1: tid = 0x2c7015b, 0x00007fff6c27c000 dyld_dyld_start, stop reason = signal SIGSTOP frame #0: 0x00007fff6c27c000 dyld_dyld_start dyld`_dyld_start: -> 0x7fff6c27c000 <+0>: popq %rdi 0x7fff6c27c001 <+1>: pushq $0x0 0x7fff6c27c003 <+3>: movq %rsp, %rbp 0x7fff6c27c006 <+6>: andq $-0x10, %rsp

Executable module set to "/usr/lib/dyld". Architecture set to: x86_64-apple-macosx. (lldb) break set -E C++ Breakpoint 1: no locations (pending). (lldb) break set -E C++ Breakpoint 2: no locations (pending). (lldb) continue Process 65437 resuming 2 locations added to breakpoint 1 2 locations added to breakpoint 2 (lldb) bt error: invalid thread (lldb) bt error: invalid thread

after dnsfwd stopped responding

(lldb) break set -E C++ Breakpoint 3: 2 locations. (lldb) bt error: invalid thread (lldb)

Wait, this seems temporary, and may be caused by unexpected responses from unbound:

Nov 04 18:11:24 unbound[61562:3] info: reply from <.> 8.8.8.8#53 Nov 04 18:11:24 unbound[61562:3] info: query response was THROWAWAY Nov 04 18:14:23 unbound[61562:3] info: Capsforid: timeouts, starting fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 8.8.4.4#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.1#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.2#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:23 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:23 unbound[61562:3] info: reply from <.> 4.2.2.3#53 Nov 04 18:14:23 unbound[61562:3] info: Capsforid: reply is equal. go to next fallback Nov 04 18:14:28 unbound[61562:3] info: response for 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.5.0.0.0.5.f.0.1.0.0.2.6.2.ip6.arpa. PTR IN Nov 04 18:14:28 unbound[61562:3] info: 127.0.0.1 www.google.com. A IN Nov 04 18:14:28 unbound[61562:3] info: resolving www.google.com. A IN Nov 04 18:14:28 unbound[61562:3] info: response for www.google.com. A IN

Now, lookups are working again.

vollkommenheit avatar Nov 04 '16 22:11 vollkommenheit

Did it stop responding for all requests or a specific one?

--loglevel 8 might help by providing some more information about what's happening.

ERR: Reply received not expected

Well this is still weird. Having a tcpdump -w foo.pcap "port 53 or port 5353" might help understand what happens and see if it's a problem from our side (beware of privacy issues before handing a dump of your DNS traffic, though).

randomstuff avatar Nov 04 '16 22:11 randomstuff

I think that episode was temporary, since dnsfwd resumed responding after those 3 minutes, and hadn't repeated it since. I think the original problem I opened this issue for has been resolved.

However, while playing with stunnel to external DNS over TLS servers, I managed to stop dnsfwd again. I will describe it below.

The data flow was like your conceptual design for dnsfwd: dnsfwd (listening on UDP 53) -> unbound (listening on UDP 5353) -> stunnel (listening on TCP 1234) -> external DNS over TLS server(s) (some provide their certs, some don't, more on this below)

This setup worked fine, when/if I had a single instance of stunnel connecting to only one external DNS over TLS server, with its cert.

Then, I tried to configure the single instance of stunnel to connect to several external DNS over TLS servers simultaneously (in a round-robin or random fashion), using a single stunnel.conf. It seems there can be only CAfile in the [dns] section. So I combined the SSL certs of those external DNS over TLS servers in one dns.pem file. (There are some other external servers that I do not have SSL certs for, since their owner-operators do not provide them.) This seems to lead to the relevant external servers to first accept the TLS connection, then quickly close the connections. This scenario eventually made dnsfwd unresponsive:

sudo dnsfwd --bind-udp 127.0.0.1:53 --connect-tcp 127.0.0.1:5353 --logformat human --loglevel 8 ... DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 38 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 1 requests dropped, 43 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: 2 requests dropped, 42 remaining DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Request received DEBUG: Forwarding request DEBUG: Request forwarded DEBUG: Reply reception error: asio.misc:2 DEBUG: Client deleted DEBUG: Request received DEBUG: New client DEBUG: Connecting

dnsfwd never did exit or shut down, it still appeared to be running in the terminal, but no DNS queries were getting answered

I had an lldb instance attached for 5+ hours, until this occured:

sudo lldb -d

(lldb) env DEBUG=1 (lldb) process attach --name dnsfwd --waitfor Process 69929 stopped

  • thread #1: tid = 0x2d25743, 0x00007fff66b35000 dyld_dyld_start, stop reason = signal SIGSTOP frame #0: 0x00007fff66b35000 dyld_dyld_start dyld`_dyld_start: -> 0x7fff66b35000 <+0>: popq %rdi 0x7fff66b35001 <+1>: pushq $0x0 0x7fff66b35003 <+3>: movq %rsp, %rbp 0x7fff66b35006 <+6>: andq $-0x10, %rsp

Executable module set to "/usr/lib/dyld". Architecture set to: x86_64-apple-macosx. (lldb) break set -E C++ Breakpoint 1: no locations (pending). (lldb) continue Process 69929 resuming 2 locations added to breakpoint 1

Below is where dnsfwd stopped

Process 69929 stopped

  • thread #1: tid = 0x2d25743, 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw libc++abi.dylib`__cxa_throw: -> 0x7fff9e63aa01 <+0>: pushq %rbp 0x7fff9e63aa02 <+1>: movq %rsp, %rbp 0x7fff9e63aa05 <+4>: pushq %r15 0x7fff9e63aa07 <+6>: pushq %r14 (lldb) bt
  • thread #1: tid = 0x2d25743, 0x00007fff9e63aa01 libc++abi.dylib`__cxa_throw, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    • frame #0: 0x00007fff9e63aa01 libc++abi.dylib__cxa_throw frame #1: 0x000000010ca1bf14 dnsfwdvoid boost::throw_exceptionboost::system::system_error(boost::system::system_error const&) + 196 frame #2: 0x000000010ca283f1 dnsfwddnsfwd::server::on_message(boost::system::error_code const&, unsigned long) + 4945 frame #3: 0x000000010ca28bc0 dnsfwdboost::asio::detail::reactive_socket_recvfrom_op<boost::asio::mutable_buffers_1, boost::asio::generic::basic_endpointboost::asio::generic::datagram_protocol, boost::bi::bind_t<void, boost::mfi::mf2<void, dnsfwd::server, boost::system::error_code const&, unsigned long>, boost::bi::list3boost::bi::value<dnsfwd::server*, boost::arg<1> ()(), boost::arg<2> ()()> > >::do_complete(boost::asio::detail::task_io_service, boost::asio::detail::task_io_service_operation, boost::system::error_code const&, unsigned long) + 128 frame #4: 0x000000010ca20b98 dnsfwdboost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) + 504 frame #5: 0x000000010ca20755 dnsfwdboost::asio::detail::task_io_service::run(boost::system::error_code&) + 181 frame #6: 0x000000010ca1f80f dnsfwdboost::asio::io_service::run() + 47 frame #7: 0x000000010ca1eda3 dnsfwdmain + 8787 frame #8: 0x00007fff90f395ad libdyld.dylibstart + 1 frame #9: 0x00007fff90f395ad libdyld.dylibstart + 1

I eventually started using separate stunnel instances for each set of external DNS servers that I have certs for (seems obvious, that stunnel requires the pertinent SSL certs for successful connections, unlike unbound). This had its own set of problems, specifically with connecting to the IPv6 addrs of those external servers. It seems (I have DEBUG = 7 on in stunnel.conf), that because I have IPv6 Privacy Extension on, my IPv6 addr keeps changing (although the changes were much more frequent than usual). This seemed to cause the external servers to first accept the SSL connection, only to suddenly close it. My end immediately connected from a new IPv6 addr (for a new query probably), got the SSL cert accepted, only to get "SSL peer suddenly disconnected" again. In the meantime, connections to the IPv4 addr of the same external servers were ok. The end result was that ~50% of the lookups were failing. This caused traffic jam in dnsfwd like "DEBUG: 2 requests dropped, 42 remaining" too, but I didn't wait to see if dnsfwd would eventually stop [responding] or not, since the stunnel-to-external IPv6 servers issue alone was making the performance pretty miserable AND I didn't have lldb attached.

vollkommenheit avatar Nov 05 '16 14:11 vollkommenheit

dnsfwd (listening on UDP 53) -> unbound (listening on UDP 5353) -> stunnel (listening on TCP 1234) -> external DNS over TLS server(s)

I guess there's something wrong here:

  1. dnsfwd can't use unbound as a upstream over UDP, it can use unbound over TCP ;
  2. you don't want to have unbound use stunnel as a upstream.

This is because AFAIU ubound doesn't recycle TCP connections between requests for when talking to an upstream server over TCP. This opens one connection per request and at some point the remote server stops talking to you. That's in fact, the whole point of dnsfwd.

The correct setup is: (optional) caching server (i.e. unbound) → dnsfwd → stunnel → remote recursive over TLS. You should probably start without a caching server first.

randomstuff avatar Nov 06 '16 22:11 randomstuff

It seems (I have DEBUG = 7 on in stunnel.conf), that because I have IPv6 Privacy Extension on, my IPv6 addr keeps changing (although the changes were much more frequent than usual). This seemed to cause the external servers to first accept the SSL connection, only to suddenly close it.

This is weird, this IPv6 address is not suppoed to happens to frequently and it shouldn't break stuff.

randomstuff avatar Nov 06 '16 22:11 randomstuff

Below is where dnsfwd stopped

Apparently it stops in __cxa_throw? How did you get this backtrace? It looks a lot like the program was running under a debugger with breakpoint on exceptions.

randomstuff avatar Nov 06 '16 22:11 randomstuff