gvisor-tap-vsock
gvisor-tap-vsock copied to clipboard
Memory leaks
Hi,
I noticed a serious memory leak while using the gvisor-tap-vsock latest master. My host ran out of memory after 2 min of iperf runs. I used the pprof to root cause the leak source and found two issues:
- UDP forwarder does not close the go routines after the connection is closed. e.g run a iperf in a loop (with udp) and you will see the number of go routine kept growing.
Steps to reproduce
On Host:
iperf3-s
On VM:
while true; do iperf3 -c 192.168.127.254 -t 1 -u; done
Open the pprof window and see the number of go routines. It keeps growing.
- Leaking the memory used for packet slice.
Steps to reproduce On Host: iperf3-s On VM: iperf3 -c 192.168.127.254 -t 180
On another host terminal run "htop" and monitor the memory usage.
The output from pprof indicate that leak is coming from the https://github.com/containers/gvisor-tap-vsock/blob/main/pkg/tap/switch.go#L264.
Stat from profile:
File: gvproxy Type: inuse_space Time: Mar 2, 2022 at 8:44pm (CST) Entering interactive mode (type "help" for commands, "o" for options) (pprof) top Showing nodes accounting for 459.35MB, 99.65% of 460.94MB total Dropped 15 nodes (cum <= 2.30MB) Showing top 10 nodes out of 38 flat flat% sum% cum cum% 299.94MB 65.07% 65.07% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rxStream 152.13MB 33.00% 98.07% 152.13MB 33.00% gvisor.dev/gvisor/pkg/tcpip/stack.glob..func1 4.78MB 1.04% 99.11% 4.78MB 1.04% gvisor.dev/gvisor/pkg/refsvfs2.Register 2.50MB 0.54% 99.65% 2.50MB 0.54% runtime.allocm 0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).Accept 0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rx 0 0% 99.65% 109.09MB 23.67% github.com/containers/gvisor-tap-vsock/pkg/tap.(*Switch).rxBuf 0 0% 99.65% 409.03MB 88.74% github.com/containers/gvisor-tap-vsock/pkg/virtualnetwork.(*VirtualNetwork).Mux.func4 0 0% 99.65% 5MB 1.09% github.com/google/tcpproxy.proxyCopy 0 0% 99.65% 5MB 1.09% gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.(*TCPConn).Read
Any idea how to fix the leak. I have not looked at the gvisor in deep to figure out yet, any pointers are appreciated.
-
The UDP forwarder uses connection tracking. It keeps the forwarding opened for 90s. If you run your loop, wait for 90s, you should see go routines count come back to normal (this is what I see on my computer at least).
-
Yes there is something here. Maybe we can try to use pool of buffer like they do in
stack.NewPacketBuffer
- Ah, I didn't realize the timeout, You are right that after 90s routine disappears
- I have tried this patch which basically decrement the count after queuing the packet. I see that in gvisor code after allocating the new packet they decrement the ref, so I tried the same. With this I am not longer seeing leak. Does this make sense ? I can submit a patch.
diff --git a/pkg/tap/switch.go b/pkg/tap/switch.go
index 39821ac..dbcd040 100644
--- a/pkg/tap/switch.go
+++ b/pkg/tap/switch.go
@@ -294,14 +294,16 @@ func (e *Switch) rxBuf(ctx context.Context, id int, buf []byte) {
}
if eth.DestinationAddress() == e.gateway.LinkAddress() || eth.DestinationAddress() == header.EthernetBroadcastAddress {
vv.TrimFront(header.EthernetMinimumSize)
+ pkt := stack.NewPacketBuffer(stack.PacketBufferOptions{
+ Data: vv,
+ })
e.gateway.DeliverNetworkPacket(
eth.SourceAddress(),
eth.DestinationAddress(),
eth.Type(),
- stack.NewPacketBuffer(stack.PacketBufferOptions{
- Data: vv,
- }),
+ pkt,
)
+ pkt.DecRef()
}
atomic.AddUint64(&e.Received, uint64(len(buf)))
The UDP forwarder uses connection tracking. It keeps the forwarding opened for 90s. If you run your loop, wait for 90s, you should see go routines count come back to normal (this is what I see on my computer at least).
Is the connection tracker protocol aware? It is common, for example, that UDP syslog messages are sent at a constant high velocity. If the code is waiting for responses (that don't happen with syslog), it will create new go routines at a faster rate than the one old ones can time out.
@codomania it looks good, thanks!
@dimalinux I took the same code as dockerd, I bet it is ok. You can take a look at the code: https://github.com/containers/gvisor-tap-vsock/blob/main/pkg/services/forwarder/udp_proxy.go (taken from https://github.com/moby/moby/blob/master/cmd/docker-proxy/udp_proxy.go)
This fix appears to be working and takes the VM from several gigabytes of memory to less than 20mb. Is there anything that prevents this fix from being applied so that people don't have to patch the source manually?
I just had to kill off gvproxy, it was consuming 30GB of ram on my M1 mac
@sc68cal me too it consumes 38GB of RAM on my M1 ultra on real memory and memory around 50GB
I think the memory leak still exists
@gam-phon can you open a new issue with more details about your usage of gvisor-tap-vsock?