grpc-go
grpc-go copied to clipboard
Add doc about `mem.*Unsafe` consuming the input buffer
I've been reading the new buffer-pool-related code due to recent OOM problems with other libraries misusing sync.Pool, and the varying behavior between Buffer and SliceBuffer had me confused for quite a while.
Buffer mutates itself when read, and implicitly calls Free when fully consumed:
https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L194-L200
While SliceBuffer returns a sub-slice but does not modify itself: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L263-L267
The only way these *Unsafe funcs can be used correctly is by replacing the input-buffer with the result-buffer(s), as the current code does: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/internal/transport/transport.go#L139
... which seems worth documenting since it feels like a huge footgun otherwise: you only trigger the self-mutating-and-freeing behavior for relatively large data, which is less common in tests.
Granted, this isn't a technical barrier at all, but the behavior differences between implementations mislead me for a while and this "must not use input Buffer" requirement doesn't seem obviously-necessary to me when reading the code. Might as well save future-people that effort.
On a related note, Buffer feels... a bit concerningly strange in regards to concurrency, and I didn't see it called out in the PR that introduced it so I think it might not have been noticed.
E.g. Buffers are not concurrency-safe: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L39-L41
But it uses an atomic refcount: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L75-L80
... but it doesn't use it in an atomically-safe way, as it sets the value to nil when freeing: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L160
And if you include the fact that the ref came from a pool: https://github.com/grpc/grpc-go/blob/5edab9e55414068e74320716117a2659c5d2174e/mem/buffers.go#L157 it seems like this code cannot possibly be correct:
- non-racing code is paying for atomics it does not need
- racing code can avoid some checks by using a type that is race-safe
- this might not be reachable in practice though
- racing code can double-
Putrefcounts and buffers (with a racingRef,Free,Ref,Freesequence) - racing code can leak refcount changes across Buffer instances due to pooled reuse
So this is introducing difficult-to-investigate failure modes that non-atomic or non-pooled instances would not have. It seems like it'd be better to either just use an int field or not-reuse atomics (if the goal is to detect races).
Also I would be quite surprised if pooled *int values perform better than non-pooled, given the cheap initialization and sync.Pool's overhead, but I don't have any evidence either way.
Is there some benefit to this I'm missing? Maybe something about it can catch tricky races or misuse that would otherwise be missed? It seems misleading and dangerous to me otherwise.
- :x: - login: @Groxx / name: Steven L . The commit (0e114fe647006ee6829beb96dc96d95c58e289ec) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 82.01%. Comparing base (
5edab9e) to head (0e114fe). Report is 71 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #8209 +/- ##
==========================================
- Coverage 82.17% 82.01% -0.17%
==========================================
Files 410 410
Lines 40236 40236
==========================================
- Hits 33065 32998 -67
- Misses 5822 5873 +51
- Partials 1349 1365 +16
| Files with missing lines | Coverage Δ | |
|---|---|---|
| mem/buffers.go | 88.18% <ø> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@Groxx Can you please sign the CLA ?
@Groxx , the PR looks good to me , but for the other part , can you open a issue and we can have a discussion there. Also adding Purnesh for a review.
Defer to @dfawley for concurrency concern mentioned.
@PapaCharlie
Do you have any response to the concurrency issues raised in the first comment? Do you think we could remove the atomics, or maybe something is needed to address the concerns? Thanks!
@Groxx we will need you to complete the CLA in order to merge this, thanks!
Regarding the atomics, each goroutine is intended to have its own Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:
// Note that a Buffer is not safe for concurrent access and instead each // goroutine should use its own reference to the data, which can be acquired via // a call to Ref().
However, you are right that it doesn't explicitly mention that the Ref call should still be synchronous. Namely, the code should look like this:
ref := buf.Ref()
go func() {
// use ref here
)()
buf.Free()
Rather than like this:
go func() {
ref := buf.Ref()
// use ref here
}()
buf.Free()
Though I would argue that the second form counts as "concurrent access" :)
The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.
As for the proposed change in general, the doc looks good to me. The original motivation for ReadUnsafe and SplitUnsafe was that they should only be used in the internal/transport package, where they would be used as expected, but it definitely pays to have better documentation around it!
since it feels like a huge footgun otherwise
Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.
Absolutely, hence calling them Unsafe in the first place. However, without that specific unsafe usage being made available by the mem package, it wasn't really possible to introduce this feature without tanking the performance.
If we don't want anyone else using it, we could have an internal package that exposes it to the rest of gRPC instead:
package grpc/internal/mem
var ReadUnsafe func(dst []byte, buf Buffer) (int, Buffer)
--
package grpc/mem
internalmem.ReadUnsafe = func(dst []byte, buf Buffer) (int, Buffer) {
return buf.read(dst)
}
But if this is turning out to be useful for others then I'm fine with leaving it exposed.
Regarding the atomics, each goroutine is intended to have its own
Buffer. In other words, a single buffer is not meant to be accessed concurrently. The docs mention the following:// Note that a Buffer is not safe for concurrent access and instead each // goroutine should use its own reference to the data, which can be acquired via // a call to Ref().
However, you are right that it doesn't explicitly mention that the
Refcall should still be synchronous. Namely, the code should look like this:ref := buf.Ref() go func() { // use ref here )() buf.Free()Rather than like this:
go func() { ref := buf.Ref() // use ref here }() buf.Free()Though I would argue that the second form counts as "concurrent access" :)
The atomic counters are used to determine when the buffer can be safely freed, as the references get counted down.
So that's actually a good sample of why I think trying to be race-safer here with atomics is actually worse than not using atomics, even just from a safety standpoint.
(as an up-front note here: all of these can be made safe and can be made to race. I'm just pointing out that a plausible misuse exists)
go func() {
ref := buf.Ref()
// use ref here
}()
buf.Free()
^ this kind of construct is a common enough mistake that https://github.com/golang/go/issues/18022 has led to adding a go vet check to try to prevent it. So yea, agreed - it shouldn't be done.
Forgetting to wait for the goroutine is also fairly common, so that example should have been something like this:
done := make(chan struct{})
go func() {
ref := buf.Ref()
// use ref here
close(done)
}()
<-done
buf.Free()
(I fully realize your example was just sample code and this is reasonable to omit, I'm just showing it to be more concrete)
This is in principle safe with synchronous use, so I think we can ignore it. It's just dubious compared to the obviously-correct Ref() outside the goroutine.
One risk that atomics bring is that in a -race build, code that makes both mistakes will be considered safe
done := make(chan struct{})
b := buffer()
go func() {
bb := b.Ref()
time.Sleep(time.Second)
// and then do not do anything with the value for some reason,
// maybe some other check no-opped it.
// or just don't use it in a racy way, e.g. maybe only read,
// or only racy-write to different locations
_ = bb
close(done)
}() //
b.Free()
<-done // note incorrect waiting, `Ref` and `Free` race
while without atomics it'll be a race violation:
WARNING: DATA RACE
Read at 0x00c000090308 by goroutine 11:
scratch/demos.(*normalbuf).Ref()
.../go/src/scratch/demos/grpc_test.go:33 +0x3c
scratch/demos.TestNormal.func2.1()
.../go/src/scratch/demos/grpc_test.go:86 +0x38
Previous write at 0x00c000090308 by goroutine 10:
scratch/demos.(*normalbuf).Free()
.../go/src/scratch/demos/grpc_test.go:37 +0x108
scratch/demos.TestNormal.func2()
.../go/src/scratch/demos/grpc_test.go:92 +0xf0
Full test here
package demos
import (
"sync/atomic"
"testing"
"time"
)
type atomicbuf struct {
count int64 // must be used atomically
}
func (b *atomicbuf) Ref() *atomicbuf {
atomic.AddInt64(&b.count, 1)
return b
}
func (b *atomicbuf) Free() {
atomic.AddInt64(&b.count, -1)
}
type normalbuf struct {
count int64
}
func (b *normalbuf) Ref() *normalbuf {
b.count++
return b
}
func (b *normalbuf) Free() {
b.count--
}
func TestAtomic(t *testing.T) {
t.Run("safe in both constructs", func(t *testing.T) {
done := make(chan struct{})
b := &atomicbuf{}
bb := b.Ref()
go func() {
time.Sleep(time.Second)
// and then do not do anything with `bb` for some reason
_ = bb
close(done)
}()
b.Free()
// ensuring the goroutine finishes, rather than ending the process early
<-done
})
t.Run("unsafe but allowed with atomics", func(t *testing.T) {
done := make(chan struct{})
b := &atomicbuf{}
go func() {
bb := b.Ref() // moved inside the goroutine
time.Sleep(time.Second)
// and then do not do anything with the value for some reason
_ = bb
close(done)
}()
b.Free()
// ensuring the race occurs, rather than ending the process early
<-done
})
}
func TestNormal(t *testing.T) {
t.Run("safe in both constructs", func(t *testing.T) {
done := make(chan struct{})
b := &normalbuf{}
bb := b.Ref()
go func() {
_ = bb // capture it
time.Sleep(time.Second)
// and then do not do anything with `bb` for some reason
close(done)
}()
b.Free()
// ensuring the goroutine finishes, rather than ending the process early
<-done
})
t.Run("unsafe and blocked by race detector", func(t *testing.T) {
done := make(chan struct{})
b := &normalbuf{}
go func() {
bb := b.Ref() // moved inside the goroutine
time.Sleep(time.Second)
// and then do not do anything with the value for some reason
_ = bb
close(done)
}()
b.Free()
// ensuring the race occurs, rather than ending the process early
<-done
})
}
With racy atomic code like that, it's also possible to trigger double-frees due to the non-atomic b.refs = nil line: that may not have propagated across threads, so old buf references may Ref(); Free() without triggering the < 0 case with panic("Cannot free freed buffer").
So it's hiding some misuse, and not reliably preventing misuse. I'd argue it should either be safe (with added runtime cost) or not (without), not something in the middle.
re CLA...
... tbh I don't think I can agree with a license that requires someone to share their mailing address. That seems like an absurd overreach.
This is the standard Linux Foundation / CNCF CLA, and the repo requires signing the agreement in order to merge pull requests. If you don't want to sign it, that's fine - we will just close this PR.
There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.
There's also some chance you may have attempted to go through the "corporate" license workflow instead of the "individual" one. Asking around, nobody was aware that you'd need to provide a mailing address to agree to the CLA. Unfortunately, there's no easy way for me to test this for myself.
No, it's the individual one. Individual_Contributor_License_Agreement.pdf And the signing UI has that field required.
I see that a fair number of others also have that clause, but nothing seems to describe why it's necessary... and there are also a fair number that do not, e.g. Google's OpenSource CLA needs just a name and username.
So yea, that seems like an absurd overreach. Surely only legal jurisdiction is relevant, it's not like CNCF is going to mail me a letter at an unverified address if they want to change a project's license or something... and I thought the whole point of a CLA was to remove that need, by retaining rights for the project.
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
Blocked on the CLA issue; being investigated in https://github.com/communitybridge/easycla/issues/4638
@Groxx can you try again? IIUC they removed the address requirement.
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
@Groxx Closing the PR as it haas been 2 weeks since any activity. Feel free to re-open.