Hight CPU usage - Socket never closed
Hello, It seems that the production version used in Rethink DNS (ee0a5ac71f) has an issue with sockets not being closed. After a day of use, the CPU usage reaches 50% even though no traffic is being generated. When we extract the processes, it appears that this usage originates from the GO code in Firestack.
If we release the library, a large number (+100) of socket closure logs are generated:
onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}
The library is configured as follows in our project:
addDnsTransport(it, dnsServer)
Intra.setSystemDNS(it, systemDns.joinToString(","))
it.setRoute(InternetProtocol.IPv46.value())
it.setTunMode(
Settings.DNSModePort,
Settings.BlockModeFilter,
Settings.PtModeAuto
)
it.resolver.translate(false)
Are we supposed to monitor and close some socket ourself? is it a bug in the library? Thanks in advance for your support.
Thanks.
we release the library, a large number (+100) of socket closure logs are generated
Netstack had a bug re: forever UDP sockets which we upstreamed & sync'd back to firestack (v055o, the next Rethink version will also carry this fix):
- https://github.com/celzero/firestack/issues/98
Unsure if ee0a5ac contains the tunnel.CloseConns("") API; empty string will close all open (TCP/UDP) conns.
There's a bug which we haven't been able to fix but it involves forever open TCP conns (and not UDP but over WireGuard). Or, it could also be a bug in our connection tracker that it may be tracking an already closed TCP connection.
After a day of use, the CPU usage reaches 50% even though no traffic is being generated
Strange. We never noticed this, but we do occassionally get reports from users complaining about battery drains.
If you can, please see if the CPU use goes away on the current tip (there's been some 600+ commits since).
Also, on the current tip, from client code, you can:
-
intra.PrintStack()to dump all active goroutine stacktraces to stdout / logcat. -
intra.Tunnel.Stat()and get back a full list of open sockets, their duration, and other metadata about the tunnel & the emulated tun device, GC stats, network stats etc as a Java/Kotlin object (its.toString()will recursively print all its fields if called over gomobile generated JNI/FFI) or a Go struct.
Are we supposed to monitor and close some socket ourself?
Not really.
is it a bug in the library?
Possibly.
Thanks for your answer @ignoramous I'm trying to upgrade the library to the latest version available to see if the leak is resolved but I run into the following issue when a DNS request is made:
ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:213: W ipmap: Get: zero ips for 1dot1dot1dot1.cloudflare-dns.com:853
dot.go:300: W dot: (Preferred) no proxy for 1dot1dot1dot1.cloudflare-dns.com:853; choosing Base among [Base]
ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:213: W ipmap: Get: zero ips for 1dot1dot1dot1.cloudflare-dns.com
ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:203: W ipmap: Add: zero ips for 1dot1dot1dot1.cloudflare-dns.com
dot.go:188: W dot: tlsdial: (Preferred) nil conn/err for 1dot1dot1dot1.cloudflare-dns.com:853, ech? false; err? unknown network no ips
dot.go:326: W dot: ans? ;; opcode: QUERY, status: SERVFAIL, id: 49375
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;clients4.google.com. IN A
err(unknown network no ips) / ans(0)
cacher.go:370: W cache: barrier: hangover(k: clients4.google.com:1); discard ans (has? true)
My implementation is the same as before with https://github.com/celzero/firestack/commit/ee0a5ac71fd80625749268d0642ea2a8b1ff3194. Can I find an example anywhere to compare what I'm supposed to change to use a DoT/DoH server now? Thank you
@ignoramous
onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}
Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops?
lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
firestack needs a bootstrap DNS server if you don't pass IPs to AddDoTTransport()
lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
firestack needs a bootstrap DNS server if you don't pass IPs to
AddDoTTransport()
How am I supposed to configure the bootstrap DNS server ? To me it was the point of calling Intra.setSystemDNS but it's not anymore...
@ignoramous
onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}
Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops?
I'm not sure how to answer you here. All I have is the following logs and a CPU trace. logcat.txt cpu-simpleperf-20241216T104527.trace.zip
Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops
I think so. conntracker.Untrack should close the pair after both upload+download are done.
https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/common.go#L229-L237
Hm, I think golang requires clients to set rw deadlines on UDP sockets. May be that's keeping the conns up?
Create a new default (bootstrap) DNS: https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/bootstrap.go#L63-L66
//dns53
intra.NewDefaultDNS(backend.DNS53, "1.1.1.2,8.8.4.4,9.9.9.10", "")
// doh
intra.NewDefaultDNS(backend.DOH, "https://dns.cloudflare-dns.com/dns-query", "1.0.0.1,1.1.1.1")
Pass its ref to intra.Connect:
https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/tun2socks.go#L69
We aren't really paying attention to the APIs tbh. And so, if you want us to simplify/stabilize it or extensively document them, let us know!
Hm, I think golang requires clients to set rw deadlines on UDP sockets. May be that's keeping the conns up?
Deadlines are used to "generate" timeout errors to prevent infinite loops.
Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops
I think so. conntracker.Untrack should close the pair after both upload+download are done.
It seems not. core.CloseOp() may close a net.UDPConn immediately.
https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/closer.go#L147
core.CloseOp() may close a net.UDPConn immediately.
For full duplex TCP, it closes read on one end, write on the other. No way to do so for UDP?
Deadlines are used to "generate" timeout errors to prevent infinite loops
True. Believe io.CopyBuffer that's used for upload+download is infinitely looping until error?
https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/cp.go#L38
CPU usage seems to be under control with the latest version. I will keep a eye on it and let you know.
Deadlines are used to "generate" timeout errors to prevent infinite loops
True. Believe
io.CopyBufferthat's used for upload+download is infinitely looping until error?https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/cp.go#L38
Yes, but rwext, one of net.Conn used by firestack has a special Read() :
https://github.com/golang/go/blob/go1.23.4/src/io/io.go#L398-L456
https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/udp.go#L76
core.CloseOp() may close a net.UDPConn immediately.
For full duplex TCP, it closes read on one end, write on the other. No way to do so for UDP?
In my gvisor-playground, one of two goroutines ( there may be more goroutines in some cases ) checks a shared variable in addition to setting deadlines, and another goroutine stores "the time of last WriteTo()" in this shared variable.
Per https://github.com/google/gvisor/issues/10023, gonet.Conns do not error on writes (empty or otherwise) even if closed.
In my gvisor-playground, one of two goroutines ( there may be more goroutines in some cases ) checks a shared variable in addition to setting deadlines, and another goroutine stores "the time of last WriteTo()" in this shared variable.
Interesting. Would such an impl (essentially, half-duplex?) make #77 work better?
Impl an "idle timeout" to let client code force-close tcp & udp conns after a preset timeout via DialerOpts in pkg settings.
https://github.com/celzero/firestack/blob/1497cdabc283cc632d084a49d8d89749f28f1025/intra/settings/dialeropts.go#L105
Which commit of hussainmohd-a/rethink-app can be built with the latest celzero/firestack? rethink on my phone is outdated.
Interesting. Would such an impl (essentially, half-duplex?) make https://github.com/celzero/firestack/issues/77 work better?
I don't know, but current firestack's Endpoint-Independent Mapping mode need optimization. I tested rethink and my gvisor-playground with quic-go/perf, rethink's handling of UDP ( single flow ) seems slower. https://github.com/quic-go/perf
Commands:
quic-go-perf --server-address=<server ip>:<server port> --upload-bytes=1G
quic-go-perf --server-address=<server ip>:<server port> --download-bytes=1G
( "upload" means from emulator to host )
rethink 63868e7 + firestack aa87903 + Endpoint-Independent Mapping disabled upload: 91.1 MiB/s download: 89.7 MiB/s
rethink 63868e7 + firestack aa87903 + Endpoint-Independent Mapping enabled upload: 67.3 MiB/s download: 60.8 MiB/s
Lanius-collaris/gvisor-playground 7dce2ae upload: 95.6 MiB/s download: 89.8 MiB/s
Yes, but rwext, one of net.Conn used by firestack has a special Read() :
https://github.com/celzero/firestack/commit/c5227f86394b57c06e7d2c73235d07499d9fb1ae
It seems not. core.CloseOp() may close a net.UDPConn immediately.
https://github.com/celzero/firestack/commit/1f31ed7cb94d86211f94790768b1e292d118282f
(closing... egressing connections are now better tracked and closed + APIs to close connections and to view open connections have been implemented, too; feel free to re-open in case the originally reported issue isn't fixed)