grpc-go Add doc about `mem.*Unsafe` consuming the input buffer

trafficstars

I've been reading the new buffer-pool-related code due to recent OOM problems with other libraries misusing sync.Pool, and the varying behavior between Buffer and SliceBuffer had me confused for quite a while.

Buffer mutates itself when read, and implicitly calls Free when fully consumed: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L194-L200

While SliceBuffer returns a sub-slice but does not modify itself: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L263-L267

The only way these *Unsafe funcs can be used correctly is by replacing the input-buffer with the result-buffer(s), as the current code does: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/internal/transport/transport.go#L139

... which seems worth documenting since it feels like a huge footgun otherwise: you only trigger the self-mutating-and-freeing behavior for relatively large data, which is less common in tests.

Granted, this isn't a technical barrier at all, but the behavior differences between implementations mislead me for a while and this "must not use input Buffer" requirement doesn't seem obviously-necessary to me when reading the code. Might as well save future-people that effort.

On a related note, Buffer feels... a bit concerningly strange in regards to concurrency, and I didn't see it called out in the PR that introduced it so I think it might not have been noticed.

E.g. Buffers are not concurrency-safe: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L39-L41 But it uses an atomic refcount: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L75-L80 ... but it doesn't use it in an atomically-safe way, as it sets the value to nil when freeing: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L160

And if you include the fact that the ref came from a pool: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L157 it seems like this code cannot possibly be correct:

non-racing code is paying for atomics it does not need
racing code can avoid some checks by using a type that is race-safe
- this might not be reachable in practice though
racing code can double-Put refcounts and buffers (with a racing Ref, Free, Ref, Free sequence)
racing code can leak refcount changes across Buffer instances due to pooled reuse

So this is introducing difficult-to-investigate failure modes that non-atomic or non-pooled instances would not have. It seems like it'd be better to either just use an int field or not-reuse atomics (if the goal is to detect races).

Also I would be quite surprised if pooled *int values perform better than non-pooled, given the cheap initialization and sync.Pool's overhead, but I don't have any evidence either way.

Is there some benefit to this I'm missing? Maybe something about it can catch tricky races or misuse that would otherwise be missed? It seems misleading and dangerous to me otherwise.

Mar 29 '25 00:03 Groxx

:x: - login: @Groxx / name: Steven L . The commit (0e114fe647006ee6829beb96dc96d95c58e289ec) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

Mar 29 '25 00:03 linux-foundation-easycla[bot]

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 82.01%. Comparing base (5edab9e) to head (0e114fe). Report is 71 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8209      +/-   ##
==========================================
- Coverage   82.17%   82.01%   -0.17%     
==========================================
  Files         410      410              
  Lines       40236    40236              
==========================================
- Hits        33065    32998      -67     
- Misses       5822     5873      +51     
- Partials     1349     1365      +16

Files with missing lines	Coverage Δ
mem/buffers.go	`88.18% <ø> (ø)`

... and 24 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mar 29 '25 00:03 codecov[bot]

@Groxx Can you please sign the CLA ?

Apr 04 '25 04:04 eshitachandwani

@Groxx , the PR looks good to me , but for the other part , can you open a issue and we can have a discussion there. Also adding Purnesh for a review.

Apr 04 '25 05:04 eshitachandwani

Defer to @dfawley for concurrency concern mentioned.

Apr 15 '25 08:04 purnesh42H

@PapaCharlie

Do you have any response to the concurrency issues raised in the first comment? Do you think we could remove the atomics, or maybe something is needed to address the concerns? Thanks!

Apr 16 '25 17:04 dfawley

@Groxx we will need you to complete the CLA in order to merge this, thanks!

Apr 16 '25 17:04 dfawley

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each // goroutine should use its own reference to the data, which can be acquired via // a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:

ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()

Rather than like this:

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

Apr 16 '25 18:04 PapaCharlie

As for the proposed change in general, the doc looks good to me. The original motivation for ReadUnsafe and SplitUnsafe was that they should only be used in the internal/transport package, where they would be used as expected, but it definitely pays to have better documentation around it!

since it feels like a huge footgun otherwise

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

Apr 16 '25 18:04 PapaCharlie

Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.

If we don't want anyone else using it, we could have an internal package that exposes it to the rest of gRPC instead:

package grpc/internal/mem

var ReadUnsafe func(dst []byte, buf Buffer) (int, Buffer)
--
package grpc/mem

internalmem.ReadUnsafe = func(dst []byte, buf Buffer) (int, Buffer) {
	return buf.read(dst)
}

But if this is turning out to be useful for others then I'm fine with leaving it exposed.

Apr 16 '25 18:04 dfawley

Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:

// Note that a Buffer is not safe for concurrent access and instead each // goroutine should use its own reference to the data, which can be acquired via // a call to Ref().

However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:
ref := buf.Ref()
go func() {
  // use ref here
)()
buf.Free()
Rather than like this:
go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()
Though I would argue that the second form counts as "concurrent access" :)

The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.

So that's actually a good sample of why I think trying to be race-safer here with atomics is actually worse than not using atomics, even just from a safety standpoint.

(as an up-front note here: all of these can be made safe and can be made to race. I'm just pointing out that a plausible misuse exists)

go func() {
  ref := buf.Ref()
  // use ref here
}()
buf.Free()

^ this kind of construct is a common enough mistake that https://github.com/golang/go/issues/18022 has led to adding a go vet check to try to prevent it. So yea, agreed - it shouldn't be done.

Forgetting to wait for the goroutine is also fairly common, so that example should have been something like this:

done := make(chan struct{})
go func() {
  ref := buf.Ref()
  // use ref here
  close(done)
}()
<-done
buf.Free()

(I fully realize your example was just sample code and this is reasonable to omit, I'm just showing it to be more concrete) This is in principle safe with synchronous use, so I think we can ignore it. It's just dubious compared to the obviously-correct Ref() outside the goroutine.

One risk that atomics bring is that in a -race build, code that makes both mistakes will be considered safe

done := make(chan struct{})
b := buffer()
go func() {
	bb := b.Ref()
	time.Sleep(time.Second)
	// and then do not do anything with the value for some reason,
	// maybe some other check no-opped it.
	// or just don't use it in a racy way, e.g. maybe only read,
	// or only racy-write to different locations
	_ = bb
	close(done)
}() //
b.Free()
<-done // note incorrect waiting, `Ref` and `Free` race

while without atomics it'll be a race violation:

WARNING: DATA RACE
Read at 0x00c000090308 by goroutine 11:
  scratch/demos.(*normalbuf).Ref()
      .../go/src/scratch/demos/grpc_test.go:33 +0x3c
  scratch/demos.TestNormal.func2.1()
      .../go/src/scratch/demos/grpc_test.go:86 +0x38

Previous write at 0x00c000090308 by goroutine 10:
  scratch/demos.(*normalbuf).Free()
      .../go/src/scratch/demos/grpc_test.go:37 +0x108
  scratch/demos.TestNormal.func2()
      .../go/src/scratch/demos/grpc_test.go:92 +0xf0

Full test here

package demos

import (
	"sync/atomic"
	"testing"
	"time"
)

type atomicbuf struct {
	count int64 // must be used atomically
}

func (b *atomicbuf) Ref() *atomicbuf {
	atomic.AddInt64(&b.count, 1)
	return b
}
func (b *atomicbuf) Free() {
	atomic.AddInt64(&b.count, -1)
}

type normalbuf struct {
	count int64
}

func (b *normalbuf) Ref() *normalbuf {
	b.count++
	return b
}
func (b *normalbuf) Free() {
	b.count--
}

func TestAtomic(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		bb := b.Ref()
		go func() {
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe but allowed with atomics", func(t *testing.T) {
		done := make(chan struct{})
		b := &atomicbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}
func TestNormal(t *testing.T) {
	t.Run("safe in both constructs", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		bb := b.Ref()
		go func() {
			_ = bb // capture it
			time.Sleep(time.Second)
			// and then do not do anything with `bb` for some reason
			close(done)
		}()
		b.Free()
		// ensuring the goroutine finishes, rather than ending the process early
		<-done
	})
	t.Run("unsafe and blocked by race detector", func(t *testing.T) {
		done := make(chan struct{})
		b := &normalbuf{}
		go func() {
			bb := b.Ref() // moved inside the goroutine
			time.Sleep(time.Second)
			// and then do not do anything with the value for some reason
			_ = bb
			close(done)
		}()
		b.Free()
		// ensuring the race occurs, rather than ending the process early
		<-done
	})
}

With racy atomic code like that, it's also possible to trigger double-frees due to the non-atomic b.refs = nil line: that may not have propagated across threads, so old buf references may Ref(); Free() without triggering the < 0 case with panic("Cannot free freed buffer").

So it's hiding some misuse, and not reliably preventing misuse. I'd argue it should either be safe (with added runtime cost) or not (without), not something in the middle.

Apr 16 '25 19:04 Groxx

re CLA...

... tbh I don't think I can agree with a license that requires someone to share their mailing address. That seems like an absurd overreach.

Apr 16 '25 19:04 Groxx

This is the standard Linux Foundation / CNCF CLA, and the repo requires signing the agreement in order to merge pull requests. If you don't want to sign it, that's fine - we will just close this PR.

Apr 16 '25 22:04 dfawley

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

Apr 16 '25 22:04 dfawley

There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.

No, it's the individual one. Individual_Contributor_License_Agreement.pdf And the signing UI has that field required.

I see that a fair number of others also have that clause, but nothing seems to describe why it's necessary... and there are also a fair number that do not, e.g. Google's OpenSource CLA needs just a name and username.

So yea, that seems like an absurd overreach. Surely only legal jurisdiction is relevant, it's not like CNCF is going to mail me a letter at an unverified address if they want to change a project's license or something... and I thought the whole point of a CLA was to remove that need, by retaining rights for the project.

Apr 21 '25 17:04 Groxx

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

Apr 27 '25 18:04 github-actions[bot]

Blocked on the CLA issue; being investigated in https://github.com/communitybridge/easycla/issues/4638

Apr 28 '25 17:04 dfawley

@Groxx can you try again? IIUC they removed the address requirement.

May 14 '25 15:05 dfawley

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

May 20 '25 16:05 github-actions[bot]

This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

May 26 '25 20:05 github-actions[bot]

@Groxx Closing the PR as it haas been 2 weeks since any activity. Feel free to re-open.

May 27 '25 12:05 eshitachandwani

grpc-go grpc-go copied to clipboard

Add doc about `mem.*Unsafe` consuming the input buffer

Codecov Report

grpc-go
grpc-go copied to clipboard