https_dns_proxy
https_dns_proxy copied to clipboard
A persistent timeout occurs when the network changes (static IP changed or PPPoE reconnected).
What happened
When using https-dns-proxy on OpenWrt, I encountered a troubling issue: every time I reconnect the WAN interface using the PPPoE protocol, the DNS often becomes unresponsive. Specifically, both dig @router and dig @127.0.0.1 -p 5053 (with 5053 being the listening port of https-dns-proxy) result in a timeout.
I’ve investigated this issue in depth and found that it can be easily reproduced on Ubuntu 24.04 as well.
How to reproduce
- Install Ubuntu 24.04 (I used the Desktop version).
- Configure the network to use a static IP address. E.g.
192.168.9.100 - Open two terminal windows and run the following commands in each:
./https_dns_proxy -r https://dns.alidns.com/dns-query -p 5053 -4 -m 10 -vvv
while true; do dig 1.2.3.4.sslip.io @127.0.0.1 -p 5053 ; sleep 1; done
- Change the static IP address, for example, to
192.168.9.101.
- You can observe the output like this (please use this PR to fix logging):
[D] 1749047614.457423 https_client.c:219 E50A: * Operation timed out after 5004 milliseconds with 0 bytes received
[D] 1749047614.457494 https_client.c:219 E50A: * Connection #0 to host dns.alidns.com left intact
[D] 1749047614.457508 https_client.c:620 Released used io event: 0x7ffe22471f90
[W] 1749047614.457520 https_client.c:360 E50A: curl request failed with 28: Timeout was reached
[W] 1749047614.457537 https_client.c:362 E50A: curl error message: Operation timed out after 5004 milliseconds with 0 bytes received
[W] 1749047614.457548 https_client.c:376 E50A: Connecting and sending request to resolver was successful, but no response was sent back
[D] 1749047614.457555 https_client.c:446 E50A: CURLINFO_NUM_CONNECTS: 0
[D] 1749047614.457559 https_client.c:458 E50A: CURLINFO_EFFECTIVE_URL: https://dns.alidns.com/dns-query
[D] 1749047614.457568 https_client.c:495 E50A: Times: 0.000033, 0.000000, 0.000000, 0.000143, 0.000000, 5.004893
[I] 1749047614.457583 https_client.c:517 E50A: Response was faulty, skipping DNS reply
[D] 1749047614.457588 main.c:83 Received response for id: E50A, len: 0
[D] 1749047614.457627 main.c:112 Received request for id: E50A, len: 57
[D] 1749047614.457649 https_client.c:261 E50A: Requesting HTTP/2
[D] 1749047614.457678 https_client.c:219 E50A: * RESOLVE dns.alidns.com:443 - old addresses discarded
[D] 1749047614.457698 https_client.c:219 E50A: * Added dns.alidns.com:443:223.5.5.5,223.6.6.6 to DNS cache
[D] 1749047614.457747 https_client.c:219 E50A: * Found bundle for host: 0x559ad0c37f30 [can multiplex]
[D] 1749047614.457762 https_client.c:219 E50A: * Re-using existing connection with host dns.alidns.com
[D] 1749047614.457806 https_client.c:219 E50A: * [HTTP/2] [27] OPENED stream for https://dns.alidns.com/dns-query
[D] 1749047614.457816 https_client.c:219 E50A: * [HTTP/2] [27] [:method: POST]
[D] 1749047614.457823 https_client.c:219 E50A: * [HTTP/2] [27] [:scheme: https]
[D] 1749047614.457828 https_client.c:219 E50A: * [HTTP/2] [27] [:authority: dns.alidns.com]
[D] 1749047614.457833 https_client.c:219 E50A: * [HTTP/2] [27] [:path: /dns-query]
[D] 1749047614.457854 https_client.c:219 E50A: * [HTTP/2] [27] [user-agent: https_dns_proxy/0.3]
[D] 1749047614.457860 https_client.c:219 E50A: * [HTTP/2] [27] [accept: application/dns-message]
[D] 1749047614.457866 https_client.c:219 E50A: * [HTTP/2] [27] [content-type: application/dns-message]
[D] 1749047614.457872 https_client.c:219 E50A: * [HTTP/2] [27] [content-length: 57]
[D] 1749047614.457961 https_client.c:219 E50A: > POST /dns-query HTTP/2
[D] 1749047614.457975 https_client.c:219 E50A: > Host: dns.alidns.com
[D] 1749047614.457980 https_client.c:219 E50A: > User-Agent: https_dns_proxy/0.3
[D] 1749047614.457984 https_client.c:219 E50A: > Accept: application/dns-message
[D] 1749047614.458045 https_client.c:219 E50A: > Content-Type: application/dns-message
[D] 1749047614.458057 https_client.c:219 E50A: > Content-Length: 57
[D] 1749047614.458067 https_client.c:170 E50A: > 0000: e5 0a 01 20 00 01 00 00 00 00 00 01 01 31 01 32 ... .........1.2
[D] 1749047614.458077 https_client.c:170 E50A: > 0010: 01 33 01 34 05 73 73 6c 69 70 02 69 6f 00 00 01 .3.4.sslip.io...
[D] 1749047614.458087 https_client.c:170 E50A: > 0020: 00 01 00 00 29 04 d0 00 00 00 00 00 0c 00 0a 00 ....)...........
[D] 1749047614.458102 https_client.c:170 E50A: > 0030: 08 57 d6 2c 4d ff 74 cb 32 .W.,M.t.2
[D] 1749047614.458122 https_client.c:633 Reserved new io event: 0x7ffe22471f90
...about 5s later...
[D] 1749047619.461386 https_client.c:219 E50A: * Operation timed out after 5003 milliseconds with 0 bytes received
[D] 1749047619.461450 https_client.c:219 E50A: * Connection #0 to host dns.alidns.com left intact
[D] 1749047619.461461 https_client.c:620 Released used io event: 0x7ffe22471f90
[W] 1749047619.461471 https_client.c:360 E50A: curl request failed with 28: Timeout was reached
[W] 1749047619.461476 https_client.c:362 E50A: curl error message: Operation timed out after 5003 milliseconds with 0 bytes received
[W] 1749047619.461483 https_client.c:376 E50A: Connecting and sending request to resolver was successful, but no response was sent back
[D] 1749047619.461488 https_client.c:446 E50A: CURLINFO_NUM_CONNECTS: 0
[D] 1749047619.461491 https_client.c:458 E50A: CURLINFO_EFFECTIVE_URL: https://dns.alidns.com/dns-query
[D] 1749047619.461498 https_client.c:495 E50A: Times: 0.000058, 0.000000, 0.000000, 0.000400, 0.000000, 5.003270
[I] 1749047619.461509 https_client.c:517 E50A: Response was faulty, skipping DNS reply
[D] 1749047619.461512 main.c:83 Received response for id: E50A, len: 0
[D] 1749047619.462845 main.c:112 Received request for id: E50A, len: 57
[D] 1749047619.462857 https_client.c:261 E50A: Requesting HTTP/2
[D] 1749047619.462870 https_client.c:219 E50A: * RESOLVE dns.alidns.com:443 - old addresses discarded
[D] 1749047619.462873 https_client.c:219 E50A: * Added dns.alidns.com:443:223.5.5.5,223.6.6.6 to DNS cache
[D] 1749047619.462890 https_client.c:219 E50A: * Found bundle for host: 0x559ad0c37f30 [can multiplex]
[D] 1749047619.462894 https_client.c:219 E50A: * Re-using existing connection with host dns.alidns.com
[D] 1749047619.462907 https_client.c:219 E50A: * [HTTP/2] [29] OPENED stream for https://dns.alidns.com/dns-query
[D] 1749047619.462911 https_client.c:219 E50A: * [HTTP/2] [29] [:method: POST]
[D] 1749047619.462912 https_client.c:219 E50A: * [HTTP/2] [29] [:scheme: https]
[D] 1749047619.462913 https_client.c:219 E50A: * [HTTP/2] [29] [:authority: dns.alidns.com]
[D] 1749047619.462915 https_client.c:219 E50A: * [HTTP/2] [29] [:path: /dns-query]
[D] 1749047619.462916 https_client.c:219 E50A: * [HTTP/2] [29] [user-agent: https_dns_proxy/0.3]
[D] 1749047619.462921 https_client.c:219 E50A: * [HTTP/2] [29] [accept: application/dns-message]
[D] 1749047619.462922 https_client.c:219 E50A: * [HTTP/2] [29] [content-type: application/dns-message]
[D] 1749047619.462924 https_client.c:219 E50A: * [HTTP/2] [29] [content-length: 57]
[D] 1749047619.462954 https_client.c:219 E50A: > POST /dns-query HTTP/2
[D] 1749047619.462957 https_client.c:219 E50A: > Host: dns.alidns.com
[D] 1749047619.462958 https_client.c:219 E50A: > User-Agent: https_dns_proxy/0.3
[D] 1749047619.462959 https_client.c:219 E50A: > Accept: application/dns-message
[D] 1749047619.462960 https_client.c:219 E50A: > Content-Type: application/dns-message
[D] 1749047619.462961 https_client.c:219 E50A: > Content-Length: 57
[D] 1749047619.462963 https_client.c:170 E50A: > 0000: e5 0a 01 20 00 01 00 00 00 00 00 01 01 31 01 32 ... .........1.2
[D] 1749047619.462965 https_client.c:170 E50A: > 0010: 01 33 01 34 05 73 73 6c 69 70 02 69 6f 00 00 01 .3.4.sslip.io...
[D] 1749047619.462967 https_client.c:170 E50A: > 0020: 00 01 00 00 29 04 d0 00 00 00 00 00 0c 00 0a 00 ....)...........
[D] 1749047619.462969 https_client.c:170 E50A: > 0030: 08 57 d6 2c 4d ff 74 cb 32 .W.,M.t.2
[D] 1749047619.462974 https_client.c:633 Reserved new io event: 0x7ffe22471f90
Analysis
It is clear that using the -m parameter (CURLOPT_MAXAGE_CONN) does not alleviate the problem. As long as the dig continues running, the timeout can persist for a long time (I tested for 10 minutes and it was still timing out).
This issue is likely related to HTTP/2 TCP connection reuse. When forcing HTTP/1.1 with the -x option, the timeout only lasts for a short period.
I suspect this issue may be related to TCP half-open connections. In the reproduction steps, changing to a different IP address rather than simply toggling the ethernet interface is necessary.
Possible Solutions
Replacing CURLOPT_MAXAGE_CONN with CURLOPT_MAXLIFETIME_CONN could resolve this issue.
According to this function, CURLOPT_MAXAGE_CONN compares conn->created with the current time, which helps avoid the issue where conn->lastused gets updated by "send-only" activity.
Related issue
https://github.com/aarond10/https_dns_proxy/issues/106
This issue is similar to the problem at hand and provides a temporary workaround by restarting the service on OpenWrt. However, I believe this solution is not ideal.
https://github.com/curl/curl/issues/3132
This issue includes an in-depth discussion of the problem and suggests that using HTTP/2 PING frames might be a potential solution—though. But I haven't seen a concrete implementation provided.
Hi, first of all, thank you very much for the pull request and the very detailed ticket!
I personally don't like the idea of using CURLOPT_MAXLIFETIME_CONN, since re-using the connection forever would be the goal instead of re-opening it regularly. In my network, when I'm working from home, my proxy uses the very same connection for around 5 hours to cloudflare. I also use "-m 400" parameter, since that is cloudflare max keepalive idle time.
Before a similar issue have been reported: #152 Resolution was to restart the proxy when WAN IP changes. I suppose you can automate that somehow.
So at the moment I recommend the proxy restart work around and I would not do code changes in this repo.
Br, Balázs
Thank you for your reply, @baranyaib90.
I also agree that CURLOPT_MAXLIFETIME_CONN is not an ideal option, since when DoH requests are successful, we want the TCP connection to remain open as long as possible. I believe the root cause of the current issue lies in the fact that CURLOPT_MAXAGE_CONN cannot handle the TCP half-open problem after a WAN IP change. This issue seems like it should be addressed upstream (i.e., by curl itself), however I'm not sure whether curl has the ability or motivation to fix it.
Given the current situation, we need to find a practical workaround. Compared to OpenWrt, I believe https_dns_proxy is a more appropriate place to address this, because as shown in the example above, the issue can be easily reproduced even on Ubuntu. If the fix is made only on OpenWrt, other operating systems will not benefit from it.
As for the specific solution, I believe https_dns_proxy could actively trigger https_client_reset() when it detects continuous DNS query timeouts or failures. Clearly, implementing this in https_dns_proxy wouldn't be difficult. The specific implementation details can be discussed once there is a decision to proceed with the solution.
In real-world TCP/IP networking practice, it's quite common for upper-layer protocols to enhance reliability when the underlying protocols can't provide sufficient availability. For example, although the Linux operating system typically ensures TCP keepalive, it's still recommended to implement heartbeat mechanisms at the application layer when using WebSocket (see: https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#pings_and_pongs_the_heartbeat_of_websockets).
For myself, since my PPPoE changes IP every time it reconnects, and devices like NAS on my LAN network continuously make DNS requests, this causes a significant period of DNS unavailability each time. As a result, I have had to temporarily disable the DoH feature. Therefore, I kindly ask you to reconsider this issue, as it causes confusion for me and many other OpenWrt users.
Thank you very much.
Hi, OK, I won't give up on this issue easily then. Checking the timeouts frequency is not a bad idea, but maybe I have a bit more sophisticated one. I would like to know your opinion on this: If curl timeout happens, I would querry CURLINFO_LOCAL_IP of the request handle. Then I would check with getifaddrs() if any interface of the hosts have that IP or not. If IP not found, then calling https_client_reset(). Br, Balázs
Using getifaddrs() only retrieves the IP addresses of the local machine. This approach is relatively straightforward and should address some cases, but its general applicability may be limited.
Consider the following scenario (though I haven't had a chance to test it yet): OpenWrt acts as a gateway using PPPoE, and there's a Ubuntu machine on the LAN running https_dns_proxy. When the WAN PPPoE reconnects, the Ubuntu machine’s LAN IP remains unchanged, so using getifaddrs() wouldn't be able to handle this case.
In this PR, I fixed the logging of curl_result_code, and confirmed through testing that the error code is CURLE_OPERATION_TIMEDOUT when a timeout occurs.
Maybe you could add a new CLI argument like max_timeout_count, which means resetting after the number of timeouts exceeds N.
OK, I got your point. Although it is sill not simple for me how the logic should be implemented to cut the connection. If someone sets the max_timeout_count CLI argument that you suggested to 30 and there is a large DNS burst (while browsing the web it happens often) it could be that the proxy would cut the connection prematurely if there is just some disturbance with the WAN connection (and not a permanent failure). On the contrary during midnight reaching the 30 timeout could take like minutes. So at the moment it is not straight forward to implement and needs time to think it trough properly. Because of lack of free time I will put this on hold on my virtual task list. So sorry, but don't expect a solution soon. Obviously if you figure out some reliable solution, you could contribute it again in a pull request and we can consider it.
Agree with your concern. We can first focus on defining a reasonable solution, as for the code implementation, either I or any other volunteer can give it a try. (To be honest, I’m not a C expert, so I can’t guarantee I’ll be able to complete it, but I’m happy to explore it.)
Just thought of this:
Instead of counting timeouts, we could add a new option like max_failure_duration(failure_reset_threshold or failure_reset_delay), which means if all DNS requests timeout for that duration, reset the connection.
This avoids false resets during short bursts but still recovers from real outages.
https://github.com/curl/curl/pull/17613
Hi, I think I have an acceptable proposal for you. But first some insights: At first, when there is no connection to be reused, the timeout is 10 sec for a new connection to open and to serve the first request. After that, the timeout will be 5 sec for a request to be processed when reusing a connection. When a timeout happens by any means, I would start a 10 sec timer. If that runs out, I would call https_client_reset(). But if any request would succeed while the timeout is running, then I would cancel the timer. This would work fine if WAN IP changes, but not when there are some working connections but even at lest one non-working one. Then the troubled connections would not be closed. To solve this as well I could not cancel the reset timer regardless some requests would succees. Is this acceptable? Br, Balázs
If you are interested of the proposed code: https://github.com/baranyaib90/https_dns_proxy/commit/0649594d97a5bb38d8ae596f75239fb6aa13c7b8 I have not tested it at all! May not work. I will continue after 1 week.
Hi @baranyaib90, Thanks for your continued efforts!
I’ve tested your commit in the WAN IP change scenario, and I can confirm that it does work in that case. I really appreciate your time digging into this issue.
Regarding your idea of starting a 10s timer after any timeout and always resetting, I think it may lead to false resets on unstable networks like cellular, where temporary high latency can cause timeouts even though the connection is still functional.
In my opinion, it's sufficient if we focus on reliably detecting situations where all DNS queries are timing out, as that alone would effectively cover link-layer disruptions such as IP changes or PPPoE reconnections.
Regarding the code implementation, I have a few suggestions:
-
Could we provide a CLI argument to configure the 10s timeout?
-
I'm not sure if
repeatis the right field here. It seems to make the timer run every 10 seconds instead of just once. Please double-check its usage.
I’m really looking forward to the official merge of this feature. Thanks!
On cellular, I would rather see otherwise-unneeded reconnects than a completely non-working package. My operator has an insanely low timer for idle TCP connections (30 seconds).
Hi @patrakov, my suggested solution to fix this problem would not change the -m max_idle_time option, a new option will be introduced:
-L conn_loss_time Time in seconds to tolerate connection timeouts of reused connections.
This option mitigates half-open TCP connection issue (e.g. WAN IP change).
(Default: 15, Min: 5, Max: 60)
I suppose with this option you can tweak the proxy to work in your network correctly. Br, Balázs