esp-idf dns_clear_cache() causes lockup (IDFGH-13375)

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.2.2

Espressif SoC revision.

ESP32-S3 (QFN56) (revision v0.1)

Operating System used.

Linux

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

Custom board

Power Supply used.

External 5V

What is the expected behavior?

Calling dns_clear_cache() should just clear the DNS cache and does fresh DNS query from remote DNS server.

What is the actual behavior?

If a dns_clear_cache() is called in another thread (rather than LWIP task), it may cause lockups.

Someone on the forum also spot the same issue, see: https://www.esp32.com/viewtopic.php?t=25239

Steps to reproduce.

Run a WebSocket RPC client on a network interface for a while
When the network interface connection is down somehow (e.g. someone unplug the ethernet), the program stops the WebSocket client, clear the DNS cache with dns_clear_cache() and switch to another network interface
Start the WS client again, now it will lock up when it query the DNS.
Now it may (or sometimes may not) stuck at sending DNS request ID forever, if LWIP debugging log is enabled.

Debug Logs.

D (19448) lwip: dns_enqueue: "some.internal.domain": use DNS entry 0

I (19468) netmgr: Refresh: netif @ idx=0: en1; 0x3fca2b38, conn? yea, priority=2
D (19468) lwip: dns_enqueue: "some.internal.domain": use DNS pcb 0

I (19478) netmgr: Refresh: netif @ idx=1: pp2; 0x3fca28e8, conn? nah, priority=1
D (19478) lwip: dns_send: dns_servers[0] "some.internal.domain": request

W (19498) netmgr: Default NETIF set to 0x3fca2b38 "en1"
D (19498) lwip: sending DNS request ID 57115 for name "some.internal.domain" to server 0

More Information.

Also see: https://www.esp32.com/viewtopic.php?t=25239

Possible workaround is we probably can issue DNS_TABLE_SIZE times of DNS requests by repeatedly calling gethosebyname() with different valid host names for now. Currently DNS_TABLE_SIZE is 4. But this will waste more data, and the IT team definitely isn't happy to see that when they do security auditing!

Aug 01 '24 02:08 huming2207

Sorry I was in a rush earlier and I misread that forum post. Even if the dns_clear_cache() run in the LwIP thread it still may cause lockup. I'm also checking if we clear the DNS by setting TTL to 0 would work or not which initially mentioend by Linetkux Wang.

Aug 01 '24 04:08 huming2207

I digged in a bit more, it looks like if there's a DNS query ongoing, and the dns_clear_cache() is called or somehow change the DNS cache table item's state to DNS_STATE_UNUSED, after the DNS query comes back, the LwIP stuff will lock up.

Aug 02 '24 03:08 huming2207

any updates on this? @espressif-abhikroy

Aug 15 '24 04:08 huming2207

any updates on this? @espressif-abhikroy

@huming2207 Thank you for bringing this issue to our attention. It seems that the dns_clear_cache(void) function currently performs a simple memset() without executing dns_call_found(i, NULL);. This leads to the dns_table database being erased, along with the stored callback, which in turn causes tcpip_send_msg_wait_sem() to block indefinitely. As a result, functions like gethostbyname() and other netdb APIs will experience blocking.

I am actively working on a fix, and it will be available shortly. Thank you for your patience.

Aug 16 '24 13:08 abhik-roy85

Hi @espressif-abhikroy

Thanks for the reply!

This leads to the dns_table database being erased, along with the stored callback, which in turn causes tcpip_send_msg_wait_sem() to block indefinitely. As a result, functions like gethostbyname() and other netdb APIs will experience blocking.

That sounds a bit problematic...😅 I'll wait for the fix. Thanks.

Regards, Jackson

Aug 16 '24 23:08 huming2207

May I also ask will this fix planned to be backported to IDFv5.3?

Sep 06 '24 02:09 huming2207

Hi @espressif-abhikroy

Any updates on this issue? Sorry for urging but this indeed affecting one of our products...

Oct 08 '24 03:10 huming2207

We are in the process of merging the fix, and it will be available soon. I apologize for the delay. In the meantime, you can apply this patch locally to resolve the issue: dns_clear_cache_fix.patch

Oct 09 '24 09:10 abhik-roy85

We are in the process of merging the fix, and it will be available soon. I apologize for the delay. In the meantime, you can apply this patch locally to resolve the issue:

Hi @espressif-abhikroy Thanks for the follow up. I think I have done something similar in your patch but it wasn't quite work for us. Maybe I was wrong. I will see if I can arrange some experiments later and let you know the outcome.

Oct 09 '24 10:10 huming2207

Also @espressif-abhikroy another idea I came up with, is that maybe in our case we don't need to (and we shouldn't need to) cache the DNS on ESP32. I think it would be nice to if we can add add a Kconfig and disable this cache feature completely. I will try working on a pull request later if I have some time.

Oct 09 '24 10:10 huming2207

@espressif-abhikroy I found https://github.com/espressif/esp-lwip/commit/b15cd2de75d408f9f813367571143b9bcff20738 in esp-lwip 2.2.0-esp branch, but the fix is not yet in esp-idf. Also note, the release branches of esp-idf are using esp-lwip 2.1.2-esp branch which does not include above fix.

BTW, you mentioned the fix will be available shortly on Aug 16 : https://github.com/espressif/esp-idf/issues/14287#issuecomment-2293483630. The fix is so low.

Oct 29 '24 06:10 AxelLin

BTW, you mentioned the fix will be available shortly on Aug 16 : #14287 (comment). The fix is so low.

Yea agreed. I have to point out this issue should have been esclated and prioritised, as it really hurts anyone who uses two or more different network PHYs with two or more different ISPs. For exmaple, using WiFi to connect to a cooperate network + using cellular for backup. It will lock up the whole firmware while switching between the networks, not just the network stack

CC @igrr

Oct 29 '24 23:10 huming2207

Also I think this need to be backported to esp-lwip 2.1.x with IDF v5.2 and v5.3, not just v2.2.0. We can't risk ourselves on using ESP-IDF master branches.

Oct 29 '24 23:10 huming2207

Fixed in master: e4c92855eea7a1c8db9dda83297c7e9c9e195eb5. I'm still waiting for the fix for release branches.

Nov 05 '24 05:11 AxelLin

Also I think this need to be backported to esp-lwip 2.1.x with IDF v5.2 and v5.3, not just v2.2.0. We can't risk ourselves on using ESP-IDF master branches.

@espressif-abhikroy How is the status for backporting to release branches? I still don't find this fix in esp-lwip 2.1.3-esp, so it's not clear to me if you will upgrade esp-lwip to 2.2.0-esp for relase branches or you will fix esp-lwip 2.1.3-esp. The release branches are not stable if the fix comes so slow.

Nov 19 '24 02:11 AxelLin

@david-cermak Any chance to take a look why this fix is not yet available in release branches?

Jan 10 '25 14:01 AxelLin

Sorry for the trouble, it was merged yesterday to v5.1 and will be included in v5.1.6 will merge to v5.2 next week. PS: v5.4 (f75e399) and v5.3 (6d18437) -- already fixed.

Jan 10 '25 15:01 david-cermak

Thanks for reporting and sorry for the slow turnaround, fix on release/5.2 is available at https://github.com/espressif/esp-idf/commit/3dd245fbc4b15b4fcb9824efd6bb6c5bf72f4ef3 and fix on release/5.1 is available https://github.com/espressif/esp-idf/commit/2f9661bcdde9217a674a0325a205d94e298e67c9, feel free to reopen.

Jan 26 '25 03:01 Alvin1Zhang

Hi @Alvin1Zhang and @david-cermak

I think this fix is broken again on ESP-IDF v6.0, see: https://github.com/espressif/esp-protocols/issues/932

Nov 06 '25 00:11 huming2207