FLTDNS Server not responding every minute
See https://discourse.pi-hole.net/t/fltdns-server-not-responding-every-minute/83484
Diagnosis summary from debugging:
- The Pi-hole FTL component periodically becomes unresponsive for several seconds (up to nearly 30 seconds), causing DNS requests to fail.
-
System call tracing (
strace) and thread inspection (gdb,pstack) showed that a critical thread (housekeeper, runningGC_threadingc.c) frequently blocks on acquiring a shared memory mutex (_lock_shminshmem.c), waiting for up to tens of seconds. - This mutex only protects shared memory region access, but code holding the lock may accidentally include slow operations, making all other threads (such as those serving DNS) unable to proceed for prolonged periods.
- None of the other threads appear to be running inside the locked section, reinforcing the analysis that the lock is either held too long or not released promptly due to a code path in maintenance or garbage collection routines.
-
Recommendation: Review the code in
GC_threadandshmem.cso that only essential, fast operations are done under the shared-memory lock. Move slow I/O, disk, and network operations outside the locked region. Add debug logging for lock durations to monitor.
TL;DR:
The issue is almost certainly caused by the garbage collection/housekeeper thread holding a shared memory lock for too long, starving other threads and causing server unresponsiveness. Refactoring the code to minimize time spent in the lock should resolve or greatly mitigate the issue.
No idea if this really helps from AI
And what i noticed now is that /usr/bin/pihole-FTL is at 100% on one core every minute, probably blocking DNS!
Holding the SHM lock during disk/db operations is causing the block/high CPU. Split your logic so only in-memory ops are protected. Move disk/db work OUTSIDE the lock.
Can you share some excerpts from your pihole-FTL.log file while things are not responding? Maybe worth enabling debug.all so we can get a complete picture.
Is this a very busy server? Looking at your debug log from discourse it appears to be very high spec'd, so trying to imagine the volume of queries that you have passing through it to cause lockups.
One thing that does happen every minute (by default) is that FTL will store all in-memory queries to the disk database. This shouldn't take long, but I guess this depends on how queries are going through the system....
I have just tried to reproduce on my own machine with a script throwing some 250q/s at it's peak, but no lockups.
I attached the log file. I reduced the database size from ~8GB to ~4GB and 30 days. It's an improvement but still i have some non-response seconds.
Di 18. Nov 00:50:56 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:50:57 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:50:59 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:02 CET 2025: UDP Port 53 on 192.168.1.100 not responding
Di 18. Nov 00:51:03 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:04 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:05 CET 2025: UDP Port 53 on 192.168.1.100 responding
So check out the log around 00:51:02 and let me know what you think. PS: Pihole runs in a Proxmox LXC container. But i don't think this is a big issue. The server is busy but not THAT busy to block DNS i would say.
Actually @DocMAX - I hadn't realised there was already some work towards fixing this on the branch tweak/dont-lock-on-export (see https://github.com/pi-hole/FTL/pull/2700 for details)
If you are running a native install - you can run pihole checkout ftl tweak/dont-lock-on-export to see if that fixes the issue you are seeing (pihole checkout master will bring you back to the released version)
Oh thanks. I hadn't realised either. Glad to see i am not alone. Edit: It's "pihole checkout ftl tweak/dont-lock-on-export" by the way... Edit2: Looks good, no DNS hick-ups anymore...
Edit: It's "pihole checkout ftl tweak/dont-lock-on-export" by the way...
Thanks - I was confusing scripts!
Edit2: Looks good, no DNS hick-ups anymore...
So the AI was wrong (here). The AI response in the Discourse thread was more spot-on:
Holding the SHM lock during disk/db operations is causing the block/high CPU. Split your logic so only in-memory ops are protected. Move disk/db work OUTSIDE the lock.
The issue is indeed with exporting queries onto the disk. The housekeeper is a rather performant part that has been optimized a lot as it is known to be a "critical path".