netavark icon indicating copy to clipboard operation
netavark copied to clipboard

Thread leak in netavark-dhcp-proxy

Open jsonn opened this issue 2 years ago • 12 comments

Using SuSE MicroOS with a bunch of macvlan-using containers, I see netvark-dhcp-proxy hanging every few days. From journalctl:

netavark[14606]: thread 'tokio-runtime-worker' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /home/abuild/rpmbuild/BUILD/rustc-1.71.1-src/library/std/src/thread/mod.rs:686:29

Even with RUST_BACKTRACE=1 set, it doesn't give a backtrace. Last time this happened, ps reported over 4000 threads for the PID.

jsonn avatar Sep 18 '23 15:09 jsonn

How many macvlan containers are we talking about? Do you know how long your DHCP lease time is?

Luap99 avatar Sep 18 '23 15:09 Luap99

16 container ATM, 10 minutes.

jsonn avatar Sep 18 '23 15:09 jsonn

Ok I think that explains why it leaks so fast then. I think we spawn a new thread for each lease but somehow the code does not cleanup the old one so we leak the old thread. I take a look.

Luap99 avatar Sep 18 '23 15:09 Luap99

Any news?

jsonn avatar Mar 25 '24 10:03 jsonn

No, I haven't found the time to reproduce this issue.

Luap99 avatar Apr 02 '24 17:04 Luap99

I can take a look at this issue. Can someone point me in the right direction to reproduce this?

Jackbaude avatar May 07 '24 20:05 Jackbaude

Use macvlan and a DHCP server with as short a lease as reasonable, e.g. a minute. Observe the number of threads?

jsonn avatar May 07 '24 20:05 jsonn

yes checking ls /proc/$pidOfProxy/task/ over time should show the leak I guess

Luap99 avatar May 08 '24 11:05 Luap99

I am now able to replicate. I started 10 containers on a network where the lease is only 60 seconds. In my case, the nv dhcp-proxy PID is 6808 and after a short while:

Threads:	552

baude avatar Jun 24 '24 19:06 baude

Ah, just noticed this issue. Could this be related? My DHCP lease time is 30 mins.

https://github.com/containers/netavark/issues/1024

Thanks!

jjzazuet avatar Jul 12 '24 04:07 jjzazuet

I definitely have this thread leak, there were 13708 threads for ~15 containers after 3 days of running - and I was also seeing #618 as a symptom (I assume, of thread starvation). I have the underlying pattern (IPv6 multicast on IPv4 network)

I updated past the fix for that specific symptom and I'm watching how many threads it creates long-term

thecubic avatar Jul 13 '24 19:07 thecubic

My thread leak seems "better, but not totally fixed". I have 1497 threads after 6 days (post #1022) versus the 13708 after 3 days.

Importantly the dhcp-proxy is not spinning CPU right now and my core symptom (restarting containers sometimes had dhcp task aborts) is gone

thecubic avatar Jul 19 '24 18:07 thecubic