netavark
netavark copied to clipboard
Thread leak in netavark-dhcp-proxy
Using SuSE MicroOS with a bunch of macvlan-using containers, I see netvark-dhcp-proxy hanging every few days. From journalctl:
netavark[14606]: thread 'tokio-runtime-worker' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /home/abuild/rpmbuild/BUILD/rustc-1.71.1-src/library/std/src/thread/mod.rs:686:29
Even with RUST_BACKTRACE=1 set, it doesn't give a backtrace. Last time this happened, ps reported over 4000 threads for the PID.
How many macvlan containers are we talking about? Do you know how long your DHCP lease time is?
16 container ATM, 10 minutes.
Ok I think that explains why it leaks so fast then. I think we spawn a new thread for each lease but somehow the code does not cleanup the old one so we leak the old thread. I take a look.
Any news?
No, I haven't found the time to reproduce this issue.
I can take a look at this issue. Can someone point me in the right direction to reproduce this?
Use macvlan and a DHCP server with as short a lease as reasonable, e.g. a minute. Observe the number of threads?
yes checking ls /proc/$pidOfProxy/task/ over time should show the leak I guess
I am now able to replicate. I started 10 containers on a network where the lease is only 60 seconds. In my case, the nv dhcp-proxy PID is 6808 and after a short while:
Threads: 552
Ah, just noticed this issue. Could this be related? My DHCP lease time is 30 mins.
https://github.com/containers/netavark/issues/1024
Thanks!
I definitely have this thread leak, there were 13708 threads for ~15 containers after 3 days of running - and I was also seeing #618 as a symptom (I assume, of thread starvation). I have the underlying pattern (IPv6 multicast on IPv4 network)
I updated past the fix for that specific symptom and I'm watching how many threads it creates long-term
My thread leak seems "better, but not totally fixed". I have 1497 threads after 6 days (post #1022) versus the 13708 after 3 days.
Importantly the dhcp-proxy is not spinning CPU right now and my core symptom (restarting containers sometimes had dhcp task aborts) is gone