Optimize DNS performance with lock-free concurrency and connection pooling
Background
本 PR 主要针对 DAE 的 DNS 处理性能进行优化,通过实现无锁并发机制解决高并发场景下的 DNS 性能瓶颈问题。这些变更将传统的互斥锁保护的数据结构替换为 sync.Map,并引入专门的 DNS 组件来提供更好的资源管理。
解决的错误场景:
- DNS 解析路径中因互斥锁竞争导致的高 CPU 使用率
- 并发 UDP 流量负载下的性能下降
- 缺乏状态管理的重复 DNS 处理导致的内存膨胀
- 处理数千个并发连接时的可扩展性差
Checklist
- [x] The Pull Request has been fully tested
- [x] There's an entry in the CHANGELOGS
- [x] There is a user-facing docs PR against https://github.com/daeuniverse/dae
Full Changelogs
1. DNS Foundation Components (56fb759)
- feat(dns,udp): add lockless concurrency foundation components
- 添加
DnsHandlingStateManager用于使用 sync.Map 进行 DNS 请求去重,解决重复请求问题 - 添加
UdpHealthMonitor用于无锁连接健康跟踪,提升 UDP DNS 查询稳定性 - 添加
DnsForwarderManager带有引用计数和生命周期管理,优化 DNS 转发器资源使用 - 所有组件使用 sync.Map 来消除 DNS 处理中的锁竞争
2. DNS Core Optimizations (af2e2c6)
- perf(dns,udp): replace mutex+map with sync.Map in core components
- 在 DNS 组件中将
upstream2IndexMu+upstream2Indexmap 替换为 sync.Map,直接优化 DNS 上游查找性能 - 在 AnyfromPool 中将 RWMutex + map 替换为 sync.Map,改善 UDP DNS 连接池性能
- 消除 DNS 上游解析和 UDP 池访问中的锁竞争瓶颈
3. UDP DNS Processing Enhancement (d96dc26)
- perf(udp): enhance UDP processing with lockless concurrency
- 使用 sync.Map 和健康监控集成增强 UDP 端点池,提升 DNS 查询连接质量
- 通过增加容量和无锁状态管理优化 UDP 任务池,改善 DNS 查询并发处理
- 添加用于 DNS 连接跟踪和超时处理的健康监控集成
- 改进 DNS 任务调度、错误处理和资源清理
4. DNS System Integration (6ff101c)
- feat(control): integrate lockless concurrency optimizations
- 在 DNS 控制层中集成所有新的 DNS 优化组件
- 为系统级 DNS 处理协调连接优化组件
- 改进 DNS 错误处理、资源清理和连接管理
- 在增强 DNS 性能的同时简化 DNS 控制逻辑
Issue Reference
- Closes #589 - DNS 性能相关问题
- Closes #767 - 类似的 DNS 处理问题
Test Result
该PR较为潦草,我在尝试重新设计PR中指出存在问题的模块 故暂时没有合并计划? 推荐DNS存在问题的话可以尝试该PR
我patch这个pr,在immortalwrt上的dae有大量本机发出的dns请求解析...ip6.arpa,dae的cpu占用持续30%~50%,但不影响使用(service dae restart 后恢复了)。
log像这样: time="Jul 17 02:44:25" level=info msg="....:46858 <-> 123.123.123.123:53" _qname=f.e.f.8.3.1....1.7.0.2.8.8.0.4.2.ip6.arpa. dialer=direct dscp=0 mac="00:00:00:00:00:00" network="udp4(DNS)" outbound=direct pid=1830 pname=rpcd policy=fixed qtype=PTR
不知道还会不会再repro。
我patch这个pr,在immortalwrt上的dae有大量本机发出的dns请求解析...ip6.arpa,dae的cpu占用持续30%~50%,但不影响使用(service dae restart 后恢复了)。
不知道还会不会再repro。
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
I patch this PR and there are many local DNS requests being resolved for IPv6.arpa in dae on ImmortalWrt. The CPU utilization of dae remains at 30% to 50%, but it does not affect the usability (the issue is resolved by restarting the dae service).
I'm not sure if the issue will still occur.
repro了:
repro了:
![]()
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
Repro:
我为rpcd加上了must_direct再试:
routing {
pname(dnsmasq, uwsgi, rpcd) -> must_direct
...
我为rpcd加上了must_direct再试:
routing { pname(dnsmasq, uwsgi, rpcd) -> must_direct ...
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
I added must_direct to rpcd and tried again:
routing {
pname(dnsmasq, uwsgi, rpcd) -> must_direct
...
又repro了,貌似跟rpcd无关
又repro了,貌似跟rpcd无关
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
It seems like it's been repro'd, probably not related to rpcd.
放弃此pr换到main,此问题就not repro了。
放弃此pr换到main,此问题就not repro了。
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
The input text has been translated to English. The original text was not in Markdown format, so no formatting or structure preservation was required.
Translated text: "Give up this PR to switch to main, and this issue will no longer reproduce."
昨天可能给我干烂了,等我修复把
昨天可能给我干烂了,等我修复把
[!NOTE] The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.
It seems that you've shared a piece of text that doesn't fully translate to English. However, I can help translate parts of it or provide explanations for it.
The given text contains a mix of Chinese and English:
- "昨天可能给我干烂了" translates to "It might have been messed up for me yesterday."
- "等我修复把" is a bit unclear in standard Chinese, but it might be interpreted as "When I fix it," or "Wait for me to fix it,".
If there's a specific part you would like to translate or need more context, please let me know!
放弃此pr换到main,此问题就not repro了。
不知道为什么你用main不能reproduce, 我在使用v1.0.0仍然是这样的.
这些反向查询来源ImmortalWrt, 一部分来源于ipv4的PTR记录查询反向查询局域网主机名, 另一部分是局域网的ipv6的PTR查询. 我倒是不明白为什么你使用main不会出现这些呢? 或许这要看你规则怎么写了.