firestack icon indicating copy to clipboard operation
firestack copied to clipboard

Hight CPU usage - Socket never closed

Open paillardf opened this issue 1 year ago • 14 comments

Hello, It seems that the production version used in Rethink DNS (ee0a5ac71f) has an issue with sockets not being closed. After a day of use, the CPU usage reaches 50% even though no traffic is being generated. When we extract the processes, it appears that this usage originates from the GO code in Firestack.

If we release the library, a large number (+100) of socket closure logs are generated: onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}

The library is configured as follows in our project:

addDnsTransport(it, dnsServer)  
Intra.setSystemDNS(it, systemDns.joinToString(","))  
it.setRoute(InternetProtocol.IPv46.value())  
it.setTunMode(  
    Settings.DNSModePort,  
    Settings.BlockModeFilter,  
    Settings.PtModeAuto  
)  
it.resolver.translate(false)  

Are we supposed to monitor and close some socket ourself? is it a bug in the library? Thanks in advance for your support.

paillardf avatar Dec 18 '24 12:12 paillardf

Thanks.

we release the library, a large number (+100) of socket closure logs are generated

Netstack had a bug re: forever UDP sockets which we upstreamed & sync'd back to firestack (v055o, the next Rethink version will also carry this fix):

  • https://github.com/celzero/firestack/issues/98

Unsure if ee0a5ac contains the tunnel.CloseConns("") API; empty string will close all open (TCP/UDP) conns.

There's a bug which we haven't been able to fix but it involves forever open TCP conns (and not UDP but over WireGuard). Or, it could also be a bug in our connection tracker that it may be tracking an already closed TCP connection.

After a day of use, the CPU usage reaches 50% even though no traffic is being generated

Strange. We never noticed this, but we do occassionally get reports from users complaining about battery drains.

If you can, please see if the CPU use goes away on the current tip (there's been some 600+ commits since).

Also, on the current tip, from client code, you can:

  1. intra.PrintStack() to dump all active goroutine stacktraces to stdout / logcat.
  2. intra.Tunnel.Stat() and get back a full list of open sockets, their duration, and other metadata about the tunnel & the emulated tun device, GC stats, network stats etc as a Java/Kotlin object (its .toString() will recursively print all its fields if called over gomobile generated JNI/FFI) or a Go struct.

Are we supposed to monitor and close some socket ourself?

Not really.

is it a bug in the library?

Possibly.

ignoramous avatar Dec 26 '24 06:12 ignoramous

Thanks for your answer @ignoramous I'm trying to upgrade the library to the latest version available to see if the leak is resolved but I run into the following issue when a DNS request is made:

ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:213: W ipmap: Get: zero ips for 1dot1dot1dot1.cloudflare-dns.com:853
dot.go:300: W dot: (Preferred) no proxy for 1dot1dot1dot1.cloudflare-dns.com:853; choosing Base among [Base]
ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:213: W ipmap: Get: zero ips for 1dot1dot1dot1.cloudflare-dns.com
ipmap.go:369: W ipmap: Add: err resolving 1dot1dot1dot1.cloudflare-dns.com: lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver
ipmap.go:203: W ipmap: Add: zero ips for 1dot1dot1dot1.cloudflare-dns.com
dot.go:188: W dot: tlsdial: (Preferred) nil conn/err for 1dot1dot1dot1.cloudflare-dns.com:853, ech? false; err? unknown network no ips
dot.go:326: W dot: ans? ;; opcode: QUERY, status: SERVFAIL, id: 49375
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;clients4.google.com.	IN	 A
 err(unknown network no ips) / ans(0)
cacher.go:370: W cache: barrier: hangover(k: clients4.google.com:1); discard ans (has? true)

My implementation is the same as before with https://github.com/celzero/firestack/commit/ee0a5ac71fd80625749268d0642ea2a8b1ff3194. Can I find an example anywhere to compare what I'm supposed to change to use a DoT/DoH server now? Thank you

paillardf avatar Dec 31 '24 15:12 paillardf

@ignoramous

onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}

Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops?

Lanius-collaris avatar Dec 31 '24 20:12 Lanius-collaris

lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver

firestack needs a bootstrap DNS server if you don't pass IPs to AddDoTTransport()

Lanius-collaris avatar Dec 31 '24 20:12 Lanius-collaris

lookup 1dot1dot1dot1.cloudflare-dns.com on localhost: no resolver

firestack needs a bootstrap DNS server if you don't pass IPs to AddDoTTransport()

How am I supposed to configure the bootstrap DNS server ? To me it was the point of calling Intra.setSystemDNS but it's not anymore...

paillardf avatar Jan 02 '25 09:01 paillardf

@ignoramous

onSocketClosed: SocketSummary{Proto:udp,ID:13d815e967215ae5,PID:Base,UID:-1,Target:192.0.2.255,Rx:0,Tx:983224,Duration:5979,Rtt:0,Msg:read udp 192.0.2.1:53449->192.0.2.255:32761: use of closed network connection,Dup:true,}

Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops?

I'm not sure how to answer you here. All I have is the following logs and a CPU trace. logcat.txt cpu-simpleperf-20241216T104527.trace.zip

paillardf avatar Jan 02 '25 09:01 paillardf

Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops

I think so. conntracker.Untrack should close the pair after both upload+download are done.

https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/common.go#L229-L237

Hm, I think golang requires clients to set rw deadlines on UDP sockets. May be that's keeping the conns up?


Create a new default (bootstrap) DNS: https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/bootstrap.go#L63-L66

//dns53
intra.NewDefaultDNS(backend.DNS53, "1.1.1.2,8.8.4.4,9.9.9.10", "")

// doh
intra.NewDefaultDNS(backend.DOH, "https://dns.cloudflare-dns.com/dns-query", "1.0.0.1,1.1.1.1")

Pass its ref to intra.Connect:

https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/tun2socks.go#L69


We aren't really paying attention to the APIs tbh. And so, if you want us to simplify/stabilize it or extensively document them, let us know!

ignoramous avatar Jan 02 '25 21:01 ignoramous

Hm, I think golang requires clients to set rw deadlines on UDP sockets. May be that's keeping the conns up?

Deadlines are used to "generate" timeout errors to prevent infinite loops.

Does firestack close a pair of net.Conn in one goroutine and guarantee that these two net.Conn are not closed until another forwarding goroutine stops

I think so. conntracker.Untrack should close the pair after both upload+download are done.

It seems not. core.CloseOp() may close a net.UDPConn immediately. https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/closer.go#L147

Lanius-collaris avatar Jan 02 '25 22:01 Lanius-collaris

core.CloseOp() may close a net.UDPConn immediately.

For full duplex TCP, it closes read on one end, write on the other. No way to do so for UDP?

Deadlines are used to "generate" timeout errors to prevent infinite loops

True. Believe io.CopyBuffer that's used for upload+download is infinitely looping until error?

https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/cp.go#L38

ignoramous avatar Jan 03 '25 08:01 ignoramous

CPU usage seems to be under control with the latest version. I will keep a eye on it and let you know.

paillardf avatar Jan 03 '25 13:01 paillardf

Deadlines are used to "generate" timeout errors to prevent infinite loops

True. Believe io.CopyBuffer that's used for upload+download is infinitely looping until error?

https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/core/cp.go#L38

Yes, but rwext, one of net.Conn used by firestack has a special Read() : https://github.com/golang/go/blob/go1.23.4/src/io/io.go#L398-L456 https://github.com/celzero/firestack/blob/dc54c93937dff172db60a2dfffb83f375b628835/intra/udp.go#L76

core.CloseOp() may close a net.UDPConn immediately.

For full duplex TCP, it closes read on one end, write on the other. No way to do so for UDP?

In my gvisor-playground, one of two goroutines ( there may be more goroutines in some cases ) checks a shared variable in addition to setting deadlines, and another goroutine stores "the time of last WriteTo()" in this shared variable.

Lanius-collaris avatar Jan 03 '25 15:01 Lanius-collaris

Per https://github.com/google/gvisor/issues/10023, gonet.Conns do not error on writes (empty or otherwise) even if closed.

In my gvisor-playground, one of two goroutines ( there may be more goroutines in some cases ) checks a shared variable in addition to setting deadlines, and another goroutine stores "the time of last WriteTo()" in this shared variable.

Interesting. Would such an impl (essentially, half-duplex?) make #77 work better?

ignoramous avatar Feb 06 '25 11:02 ignoramous

Impl an "idle timeout" to let client code force-close tcp & udp conns after a preset timeout via DialerOpts in pkg settings.

https://github.com/celzero/firestack/blob/1497cdabc283cc632d084a49d8d89749f28f1025/intra/settings/dialeropts.go#L105

ignoramous avatar Feb 07 '25 17:02 ignoramous

Which commit of hussainmohd-a/rethink-app can be built with the latest celzero/firestack? rethink on my phone is outdated.

Interesting. Would such an impl (essentially, half-duplex?) make https://github.com/celzero/firestack/issues/77 work better?

I don't know, but current firestack's Endpoint-Independent Mapping mode need optimization. I tested rethink and my gvisor-playground with quic-go/perf, rethink's handling of UDP ( single flow ) seems slower. https://github.com/quic-go/perf

Commands:

quic-go-perf --server-address=<server ip>:<server port> --upload-bytes=1G
quic-go-perf --server-address=<server ip>:<server port> --download-bytes=1G

( "upload" means from emulator to host )

rethink 63868e7 + firestack aa87903 + Endpoint-Independent Mapping disabled upload: 91.1 MiB/s download: 89.7 MiB/s

rethink 63868e7 + firestack aa87903 + Endpoint-Independent Mapping enabled upload: 67.3 MiB/s download: 60.8 MiB/s

Lanius-collaris/gvisor-playground 7dce2ae upload: 95.6 MiB/s download: 89.8 MiB/s

Lanius-collaris avatar Feb 08 '25 01:02 Lanius-collaris

Yes, but rwext, one of net.Conn used by firestack has a special Read() :

https://github.com/celzero/firestack/commit/c5227f86394b57c06e7d2c73235d07499d9fb1ae

It seems not. core.CloseOp() may close a net.UDPConn immediately.

https://github.com/celzero/firestack/commit/1f31ed7cb94d86211f94790768b1e292d118282f


(closing... egressing connections are now better tracked and closed + APIs to close connections and to view open connections have been implemented, too; feel free to re-open in case the originally reported issue isn't fixed)

ignoramous avatar Sep 03 '25 03:09 ignoramous