FTL icon indicating copy to clipboard operation
FTL copied to clipboard

FLTDNS Server not responding every minute

Open DocMAX opened this issue 3 months ago • 10 comments

See https://discourse.pi-hole.net/t/fltdns-server-not-responding-every-minute/83484

DocMAX avatar Nov 17 '25 14:11 DocMAX

Diagnosis summary from debugging:

  • The Pi-hole FTL component periodically becomes unresponsive for several seconds (up to nearly 30 seconds), causing DNS requests to fail.
  • System call tracing (strace) and thread inspection (gdb, pstack) showed that a critical thread (housekeeper, running GC_thread in gc.c) frequently blocks on acquiring a shared memory mutex (_lock_shm in shmem.c), waiting for up to tens of seconds.
  • This mutex only protects shared memory region access, but code holding the lock may accidentally include slow operations, making all other threads (such as those serving DNS) unable to proceed for prolonged periods.
  • None of the other threads appear to be running inside the locked section, reinforcing the analysis that the lock is either held too long or not released promptly due to a code path in maintenance or garbage collection routines.
  • Recommendation: Review the code in GC_thread and shmem.c so that only essential, fast operations are done under the shared-memory lock. Move slow I/O, disk, and network operations outside the locked region. Add debug logging for lock durations to monitor.

TL;DR:

The issue is almost certainly caused by the garbage collection/housekeeper thread holding a shared memory lock for too long, starving other threads and causing server unresponsiveness. Refactoring the code to minimize time spent in the lock should resolve or greatly mitigate the issue.

DocMAX avatar Nov 17 '25 15:11 DocMAX

No idea if this really helps from AI

DocMAX avatar Nov 17 '25 15:11 DocMAX

And what i noticed now is that /usr/bin/pihole-FTL is at 100% on one core every minute, probably blocking DNS!

DocMAX avatar Nov 17 '25 15:11 DocMAX

Holding the SHM lock during disk/db operations is causing the block/high CPU. Split your logic so only in-memory ops are protected. Move disk/db work OUTSIDE the lock.

DocMAX avatar Nov 17 '25 16:11 DocMAX

Can you share some excerpts from your pihole-FTL.log file while things are not responding? Maybe worth enabling debug.all so we can get a complete picture.

Is this a very busy server? Looking at your debug log from discourse it appears to be very high spec'd, so trying to imagine the volume of queries that you have passing through it to cause lockups.

One thing that does happen every minute (by default) is that FTL will store all in-memory queries to the disk database. This shouldn't take long, but I guess this depends on how queries are going through the system....

I have just tried to reproduce on my own machine with a script throwing some 250q/s at it's peak, but no lockups.

PromoFaux avatar Nov 17 '25 21:11 PromoFaux

flt.log

I attached the log file. I reduced the database size from ~8GB to ~4GB and 30 days. It's an improvement but still i have some non-response seconds.

Di 18. Nov 00:50:56 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:50:57 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:50:59 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:02 CET 2025: UDP Port 53 on 192.168.1.100 not responding
Di 18. Nov 00:51:03 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:04 CET 2025: UDP Port 53 on 192.168.1.100 responding
Di 18. Nov 00:51:05 CET 2025: UDP Port 53 on 192.168.1.100 responding

So check out the log around 00:51:02 and let me know what you think. PS: Pihole runs in a Proxmox LXC container. But i don't think this is a big issue. The server is busy but not THAT busy to block DNS i would say.

DocMAX avatar Nov 17 '25 23:11 DocMAX

Actually @DocMAX - I hadn't realised there was already some work towards fixing this on the branch tweak/dont-lock-on-export (see https://github.com/pi-hole/FTL/pull/2700 for details)

If you are running a native install - you can run pihole checkout ftl tweak/dont-lock-on-export to see if that fixes the issue you are seeing (pihole checkout master will bring you back to the released version)

PromoFaux avatar Nov 19 '25 20:11 PromoFaux

Oh thanks. I hadn't realised either. Glad to see i am not alone. Edit: It's "pihole checkout ftl tweak/dont-lock-on-export" by the way... Edit2: Looks good, no DNS hick-ups anymore...

DocMAX avatar Nov 19 '25 21:11 DocMAX

Edit: It's "pihole checkout ftl tweak/dont-lock-on-export" by the way...

Thanks - I was confusing scripts!

PromoFaux avatar Nov 19 '25 21:11 PromoFaux

Edit2: Looks good, no DNS hick-ups anymore...

So the AI was wrong (here). The AI response in the Discourse thread was more spot-on:

Holding the SHM lock during disk/db operations is causing the block/high CPU. Split your logic so only in-memory ops are protected. Move disk/db work OUTSIDE the lock.

The issue is indeed with exporting queries onto the disk. The housekeeper is a rather performant part that has been optimized a lot as it is known to be a "critical path".

DL6ER avatar Nov 20 '25 18:11 DL6ER