gvisor icon indicating copy to clipboard operation
gvisor copied to clipboard

segfault: buffer.View possibly released twice resulting in nil chunk

Open ignoramous opened this issue 1 year ago • 9 comments

Description

runtime error: invalid memory address or nil pointer dereference
	/home/jitpack/golang/go/src/runtime/panic.go:770 +0x124
gvisor.dev/gvisor/pkg/buffer.(*chunk).DecRef(...)
	/pkg/mod/gvisor.dev/[email protected]/pkg/buffer/chunk.go:106
gvisor.dev/gvisor/pkg/buffer.(*View).Release(0x4009490d20)
	/pkg/mod/gvisor.dev/[email protected]/pkg/buffer/view.go:107 +0x38
gvisor.dev/gvisor/pkg/buffer.(*Buffer).removeView(0x70980e91f0?, 0x4009490d20)
	/pkg/mod/gvisor.dev/[email protected]/pkg/buffer/buffer.go:37 +0x2c
gvisor.dev/gvisor/pkg/buffer.(*Buffer).Release(0x400a5957c8)
	/pkg/mod/gvisor.dev/[email protected]/pkg/buffer/buffer.go:73 +0x2c
gvisor.dev/gvisor/pkg/tcpip/transport/udp.(*endpoint).Read.deferwrap1.(*PacketBuffer).DecRef.1()
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/stack/packet_buffer.go:204 +0x3c
gvisor.dev/gvisor/pkg/tcpip/stack.(*packetBufferRefs).DecRef(0x400842f498?, 0x400842f4c8)
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/stack/packet_buffer_refs.go:133 +0x70
gvisor.dev/gvisor/pkg/tcpip/stack.(*PacketBuffer).DecRef(...)
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/stack/packet_buffer.go:199
gvisor.dev/gvisor/pkg/tcpip/transport/udp.(*endpoint).Read(_, {_, _}, {_, _, _})
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/transport/udp/endpoint.go:299 +0x49c
gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.commonRead({0x4009f36000, 0x10000, 0x10000}, {0x7098661e48, 0x40088be008}, 0x4009198600, 0x400a1958c0, 0x400842fc98, {0x7098650000, 0x40083ece40})
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/adapters/gonet/gonet.go:311 +0xfc
gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.(*UDPConn).ReadFrom(0x40083ece40, {0x4009f36000, 0x10000, 0x10000})
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/adapters/gonet/gonet.go:639 +0x74
gvisor.dev/gvisor/pkg/tcpip/adapters/gonet.(*UDPConn).Read(...)
	/pkg/mod/gvisor.dev/[email protected]/pkg/tcpip/adapters/gonet/gonet.go:630
github.com/celzero/firestack/intra/netstack.(*GUDPConn).Read(0x400a37ce30?, {0x4009f36000?, 0x4dc?, 0x10000?})
	/home/jitpack/build/intra/netstack/udp.go:188 +0x8c
# io.copyBuffer({0x7446d7c278, 0x400a37ce30}, {0x7446d7c298, 0x4008dcaea0}, {0x4009f36000, 0x10000, 0x10000})
#	/home/jitpack/golang/go/src/io/io.go:429 +0x18c
# io.CopyBuffer({0x7446d7c278?, 0x400a37ce30?}, {0x7446d7c298?, 0x4008dcaea0?}, {0x4009f36000?, 0x7098651b40?, 0x7098651b40?})
#	/home/jitpack/golang/go/src/io/io.go:402 +0x38
# github.com/celzero/firestack/intra/core.Pipe({0x7446d7c278, 0x400a37ce30}, {0x7446d7c298, 0x4008dcaea0})
#	/home/jitpack/build/intra/core/cp.go:36 +0x1cc
# github.com/celzero/firestack/intra.upload({0x4009966550, 0x10}, {0x709865baa0, 0x4008dcaea0}, {0x709865baf8, 0x400a37ce30}, 0x40092a20c0)
#	/home/jitpack/build/intra/common.go:39 +0x10c
# created by github.com/celzero/firestack/intra.forward in goroutine 189043
#	/home/jitpack/build/intra/common.go:67 +0xe4

buffer.View.chunk is nil'd in buffer.View.Release(): https://github.com/google/gvisor/blob/8db16e88598170344d74c4aca32b90e58b9f7c43/pkg/buffer/view.go#L108

And then transport.udp.endpoint.go possibly DecRef()s an already released buffer: https://github.com/google/gvisor/blob/8db16e88598170344d74c4aca32b90e58b9f7c43/pkg/tcpip/transport/udp/endpoint.go#L233

~~One possibility is, the buffer was racy released by transport.udp.Close():~~ rcvMu is held, and so this edge case is unlikely.

This crash was reported by our Android app (cgo): https://github.com/celzero/firestack/issues/74

I don't understand this code well enough to propose a fix (at a loss how pkg ref_template even works; for ex, where/how chunkRefs defined or init'd?):

https://github.com/google/gvisor/blob/8db16e88598170344d74c4aca32b90e58b9f7c43/pkg/buffer/chunk.go#L77

Steps to reproduce

with io.Copy(dst, gonet.UDPConn)

runsc version

Android

docker version (if using docker)

No response

uname

No response

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

ignoramous avatar Jul 27 '24 11:07 ignoramous

cc: @manninglucas

ignoramous avatar Jul 27 '24 11:07 ignoramous

Looks like a use-after-free to me. This could be coming from inside netstack or from your application which uses reference counted parts of the netstack API. Looking through the netstack code nothing sticks out to me. UDP isn't too complex and that part of the API is fuzzed by syzkaller so I would surprised if there was something obviously broken there. Do you know which net protocol this connection was using (ipv4 vs ipv6)?

Also, your "steps to reproduce" doesn't give me enough information on how to build and reproduce this issue myself. Adding some detail there will make it easier for me to help you debug this issue.

manninglucas avatar Jul 29 '24 15:07 manninglucas

Also, how frequently is this crash happening? Is it occasional or every time you try to run this code?

manninglucas avatar Jul 29 '24 15:07 manninglucas

This could be coming from your application which uses reference counted parts of the netstack API.

Quite possible. The two places we do use the netstack's refcounting APIs comes from code adopted from netstack's fdbased/endpoint.go (loc1) and fdbased/processors.go (loc1, loc2, loc3) for our repo (mostly to support swapping fds to avoid creating a new LinkEndpoint).

Do you know which net protocol this connection was using (ipv4 vs ipv6)

UDP IPv4 (it looks like a QUIC connection requested by 10268 which is Instagram).

udp: b85e6888d0a12280 (proxy? Exit) 192.168.0.144:40058 -> 157.240.23.128:443 for uid 10268

"steps to reproduce"

Apologies. What our Android app does:

  1. Create a (modified) fdbased endpoint with a TUN device (ref).
  2. Sockisfy incoming packets using netstack's (gonet) UDP (ref) & TCP handlers (ref).
  3. Pipe the socksified gonet.TCPConn / gonet.UDPConn to an actual egress (remote) connection to the same destination (upload: io.Copy(egressConn, gonetConn) and download: io.Copy(gonetConn, egressConn)) (ref).

The nil ptr is hit by gonet.UDPConn.Read() called by (upload) io.Copy.

Also, how frequently is this crash happening? Is it occasional or every time you try to run this code?

Rare. Around once a week (uptime).

Like you point out, this crash could totally be due to our app's incorrect use of ref-counting APIs.

ignoramous avatar Jul 30 '24 15:07 ignoramous

Thanks for the extra info. I wasn't able to determine the root cause after looking through both the gVisor UDP code and your code for some time. I will be AFK until next week, but will look at it again when I get back. In the meantime, @kevinGC could you take a look and see if you can find any ref counting issue here?

manninglucas avatar Jul 30 '24 19:07 manninglucas

Took a look and, while I don't have anything definitive, I wonder whether it could be related to the use of goroutines that starts in firestack/intra/netstack/udp.go:udpForwarder. For example, the function passed to NewForwarder:

  • Spawns a goroutine via h.Proxy or h.ProxyMux
  • Proxy spawns another goroutine via core.Go(..., forward, ...)
  • forward spawns another goroutine to call upload
  • ... which does some copying of bytes

I think that the first goroutine spawned here now has a pointer to a PacketBuffer that it never IncRefs or DecRefs, although I'm surprised this doesn't cause a memory leak; when a udp.ForwarderRequest is created in udp.Forwarder.HandlePacket, it calls pkt.IncRef and AFAICT that's never undone via DecRef.

Perhaps try DecRefing the packet after calling CreateEndpoint -- it would at least test whether memory is leaking.

kevinGC avatar Jul 31 '24 23:07 kevinGC

Thank you.

Perhaps try DecRefing the packet after calling CreateEndpoint -- it would at least test whether memory is leaking.

Looks like IncRef was introduced in ~Feb 2023 to resolve https://github.com/google/gvisor/issues/8448#issuecomment-1411148407 I couldn't find a way to DecRef ForwardRequest's PacketBuffer since it isn't exported.

https://github.com/google/gvisor/blob/7058ea8d3067b109e1e7fe4e04f17b667252539f/pkg/tcpip/transport/udp/forwarder.go#L62

iirc, none of the other FOSS projects (ex1, ex2) we looked at (at the time) DecRef'd in TCP/UDP Forwarders. gVisor's tests don't either:

https://github.com/google/gvisor/blob/e89e736f16fad4166393147b4af07a4efa23c74c/pkg/tcpip/adapters/gonet/gonet_test.go#L403-L424

when a udp.ForwarderRequest is created in udp.Forwarder.HandlePacket, it calls pkt.IncRef and AFAICT that's never undone via DecRef.

Could DecRef be defer'd in ForwarderRequest.CreateEndpoint instead (udp.Forwarder provides no other way to process the handled PacketBuffer anyway other than ForwarderRequest.CreateEndpoint)?

ignoramous avatar Aug 23 '24 00:08 ignoramous

Looking now, I may have been wrong in #8458. It should probably be up to the caller of NewForwarder to IncRef the packet. I think I looked at the TCP fowarder and naively copied it, but the TCP forwarder has to IncRef because it starts a new goroutine. That's not true for UDP.

In any case, a leak isn't causing the panic.

kevinGC avatar Aug 26 '24 23:08 kevinGC

@ignoramous #10958 may fix this issue. Let me know if you see any improvement once it's merged.

manninglucas avatar Sep 27 '24 21:09 manninglucas

We haven't seen our latest Android clients report this error after https://github.com/google/gvisor/pull/10958.

Absent reproducible tests, I am going to safely assume that this bug stands mitigated (to the extent that it doesn't happen as frequently as did before across all our user-base).

Thanks a lot.

(feel free to re-open if gvisor team is tracking this).

ignoramous avatar Nov 02 '24 19:11 ignoramous

Looking now, I may have been wrong in https://github.com/google/gvisor/pull/8458. It should probably be up to the caller of NewForwarder to IncRef the packet. I think I looked at the TCP fowarder and naively copied it, but the TCP forwarder has to IncRef because it starts a new goroutine. That's not true for UDP.

If this is indeed a leak, I can create a new issue to track it?

ignoramous avatar Feb 24 '25 16:02 ignoramous

I think it was a race condition, not a leak. It should be fixed by #10958 so no need to create a new issue.

manninglucas avatar Feb 24 '25 17:02 manninglucas

Thanks. There isn't a need by the client code to decref() the clone()?[^0]

[^0]: KevinGC wrote in one of the comments above: "when a udp.ForwarderRequest is created in udp.Forwarder.HandlePacket, it calls pkt.IncRef and AFAICT that's never undone via DecRef. Perhaps try DecRefing the packet after calling CreateEndpoint -- it would at least test whether memory is leaking."

ignoramous avatar Feb 24 '25 18:02 ignoramous

Kevin's hypothesis turned out to be incorrect. There was not any leak in Forwarder.HandlePacket. The clone() is already decref()'d properly in all the places it needs to be. That's what I mean by saying there was not a reference leak. Clone creates a new buffer object that increments the reference count of the underlying Buffer data structure, whereas IncRef just increments the reference count of the PacketBuffer. We need the underlying Buffer to have the proper reference count, otherwise it isn't copied on write.

manninglucas avatar Feb 24 '25 19:02 manninglucas