google-cloud-go icon indicating copy to clipboard operation
google-cloud-go copied to clipboard

storage: panic: slice bounds out of range in gRPCWriter.uploadBuffer

Open winterjung opened this issue 7 months ago • 2 comments

Client

Storage gRPC Client

Description

While uploading a file using the cloud.google.com/go/storage SDK, a runtime error: slice bounds out of range panic occurred within the gRPCWriter.uploadBuffer function. This panic originates from a goroutine created internally by the SDK, making it difficult for the application level to recover.

Error Log

panic: runtime error: slice bounds out of range [:-199229440]

goroutine 29284526 [running]:
cloud.google.com/go/storage.(*gRPCWriter).uploadBuffer(0xc1e9b00240, 0x856f5a, 0xc000000, 0x1)
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_client.go:2123 +0xbcd
cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter.func1()
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_client.go:1223 +0x130
created by cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter in goroutine 150
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_client.go:1185 +0x42e

Steps to Reproduce

The exact steps to reproduce are difficult to pinpoint. This error has not occurred in the last 6 months, and our workload averages 450MB/s uploads per day. The issue might be related to large file uploads or unstable network conditions.

Potential Problem Area and Hypothesis

According to the error log, the panic occurred at line 2123 in the cloud.google.com/go/[email protected]/grpc_client.go file:

// ...
		// Prepare chunk section for upload.
		data := toWrite[sent : sent+bytesToSendInCurrReq] // grpc_client.go:2123
// ...

It appears that the index sent : sent+bytesToSendInCurrReq for the toWrite slice is either negative or out of bounds ([:-199229440]). This could be due to abnormal values in the bytesToSendInCurrReq or sent variables.

func (w *gRPCWriter) uploadBuffer(recvd int, start int64, doneReading bool) (*storagepb.Object, int64, error) {
goroutine 29284526 [running]:
cloud.google.com/go/storage.(*gRPCWriter).uploadBuffer(0xc1e9b00240, 0x856f5a, 0xc000000, 0x1)
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_client.go:2123 +0xbcd

It seems like recvd was 0x856f5a = 8,744,794, start was 0xc000000 = 201,326,592(192MiB) and doneReading was true

Relevant code:

// ...
sendBytes: // label this loop so that we can use a continue statement from a nested block
	for {
		bytesNotYetSent := recvd - sent
		remainingDataFitsInSingleReq := bytesNotYetSent <= maxPerMessageWriteSize

		if remainingDataFitsInSingleReq && doneReading {
			lastWriteOfEntireObject = true
		}

		// Send the maximum amount of bytes we can, unless we don't have that many.
		bytesToSendInCurrReq := maxPerMessageWriteSize
		if remainingDataFitsInSingleReq {
			bytesToSendInCurrReq = bytesNotYetSent
		}

		// Prepare chunk section for upload.
		data := toWrite[sent : sent+bytesToSendInCurrReq] // panic occurred here
// ...

Hypothesis:

  1. The recvd (received bytes) or sent (sent bytes) values might have been miscalculated for some reason, causing bytesNotYetSent to become negative. Consequently, bytesToSendInCurrReq could also become negative, leading to a panic when accessing the slice.
  2. The sent value is calculated as writeOffset - start. The writeOffset is updated within the determineOffset function via queryProgress. During this process, writeOffset might be incorrectly set to a value greater than start + recvd. This would cause sent to exceed recvd, eventually making bytesNotYetSent and bytesToSendInCurrReq negative.

Regarding Panic Recovery

As seen in lines 1183-1185 of grpc_client.go, the SDK creates its own goroutine for the write operation:

// ...
	// This function reads the data sent to the pipe and sends sets of messages
	// on the gRPC client-stream as the buffer is filled.
	go func() { // grpc_client.go:1185
		defer close(params.donec)
// ...

This internal goroutine makes it impossible for the package caller to wrap the call in a recover block to handle such panics. Is there any recommended way to recover from this type of panic when it originates from within the SDK's internally managed goroutine?

Environment Information

  • Docker (multi stage) on AWS EKS
    • build image: golang:1.23-bookworm
    • runtime image: gcr.io/distroless/base-debian12
  • Go version: 1.23
  • go.mod
    • google.golang.org/api v0.210.0
    • google.golang.org/grpc v1.67.1

winterjung avatar May 09 '25 10:05 winterjung

Hi @winterjung, thank you for the detailed issue!

We did a major refactoring of this code since 1.47.0, including handling some edge cases on retries. I would suggest updating to the latest release of cloud.google.com/go/storage and seeing if that resolves the issue.

I don't believe there is a way to recover from this type of panic - it really shouldn't be happening at all and we have not encountered this before.

Do you have an idea of the size of the object for which you got this issue? How often were you seeing this issue?

BrennaEpp avatar May 13 '25 06:05 BrennaEpp

@BrennaEpp Thanks for responding. We'll update the google-cloud-go SDK to the latest version and monitor the outcome. As mentioned earlier, this is the first time we've encountered this issue in over 6 months of running in production, so it may take a long time before we can be confident that the root cause has been addressed.

We're using the gRPC client for Google Cloud Storage with a chunk size of 64 MiB. We close the object when its uncompressed size exceeds 1 GiB.

e.g.

// ...
var cli *storage.Client // initiated from main.go
wc := cli.Bucket(bucket).Object(objName).NewWriter(ctx)
wc.ChunkSize = 64 * 1024 * 1024 // 64MiB

// called in another goroutine
if writtenSize > 1 * 1024 * 1024 * 1024 { // 1GiB
	wc.Close()
}

winterjung avatar May 13 '25 06:05 winterjung

Hi @winterjung, once again, thanks for opening the issue. I am closing this as not reproducible. If you encounter this again don't hesitate to re-open or open a new issue.

BrennaEpp avatar Jul 01 '25 06:07 BrennaEpp

Hello @BrennaEpp

After updating the SDK to Go 1.25.3 and [email protected], we started seeing intermittent panic errors occurring inside the SDK itself. These panics cannot be recovered from within our application.

Has this issue been reported before, or are there any known workarounds or fixes?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x138beca]

goroutine 10308988 [running]:
cloud.google.com/go/storage.(*gRPCResumableBidiWriteBufferSender).sendBuffer(0x0, {0x26456f8?, 0xc50263cd80?}, {0xc53b7e0000?, 0xc000341740?, 0x261a200?}, 0xc000341740?, 0x0, 0x0)
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:514 +0x4a
cloud.google.com/go/storage.(*gRPCWriter).uploadBuffer(0xc445ba45b0, {0x26456f8, 0xc50263cd80}, 0xc2fae0d500?, 0x24?, 0x0)
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:626 +0x209
cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter.func1.(*grpcStorageClient).OpenWriter.func1.1.2({0x26456f8?, 0xc50263cd80?})
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:185 +0x50
cloud.google.com/go/storage.run.func1()
	/go/pkg/mod/cloud.google.com/go/[email protected]/invoke.go:104 +0x1f0
cloud.google.com/go/internal.retry({0x26456f8, 0xc398bda4b0}, {0x3b9aca00, 0x6fc23ac00, 0x4000000000000000, 0x77359400}, 0xc0a0a35e78, 0x23c92f8)
	/go/pkg/mod/cloud.google.com/[email protected]/internal/retry.go:39 +0x74
cloud.google.com/go/internal.Retry(...)
	/go/pkg/mod/cloud.google.com/[email protected]/internal/retry.go:32
cloud.google.com/go/storage.run({0x26456f8, 0xc398bda4b0}, 0xc12caa5f70, 0xc445e02870, 0x0)
	/go/pkg/mod/cloud.google.com/go/[email protected]/invoke.go:91 +0x317
cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter.func1.1(...)
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:200
cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter.func1()
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:220 +0x1f6
created by cloud.google.com/go/storage.(*grpcStorageClient).OpenWriter in goroutine 10309099
	/go/pkg/mod/cloud.google.com/go/[email protected]/grpc_writer.go:168 +0x4ff

This occurred in the same environment as previously reported issues, with the following dependencies:

  • cloud.google.com/go/storage v1.56.3
  • google.golang.org/api v0.252.0
  • google.golang.org/grpc v1.76.0
  • google.golang.org/protobuf v1.36.10

Please let me know if you need additional information — I’ll be happy to provide more details.

Thank you.

winterjung avatar Oct 21 '25 05:10 winterjung

My read of that stack trace is that w.streamSender is a typed nil for the interface gRPCBidiWriteBufferSender, with type *gRPCResumableBidiWriteBufferSender.

I think it happens when we return an error here: https://github.com/googleapis/google-cloud-go/blob/storage/v1.56.3/storage/grpc_writer.go#L475. In that case, we assign a typed nil to the stream sender interface here: https://github.com/googleapis/google-cloud-go/blob/storage/v1.56.3/storage/grpc_writer.go#L605 and we will skip this initialization on the next run through uploadBuffer.

This can only happen for resumable uploads, since the other buffer sender init functions cannot return an error.

I will check if this is still present in the v1.57.1 refactor.

cjc25 avatar Oct 31 '25 15:10 cjc25

This specific issue is fixed in v1.57.1 because https://github.com/googleapis/google-cloud-go/blob/storage/v1.57.1/storage/grpc_writer.go#L210 cannot set the stream sender interface to nil.

I think the patch to fix v1.56.3 is relatively straightforward and probably worthwhile. The issue has been present since https://github.com/googleapis/google-cloud-go/commit/b4d86a52bd319a602115cdb710a743c71494a88b. The prior iteration of the code didn't have an abstraction which encapsulated oneshot vs. resumable uploads, so it didn't have this precise issue.

cjc25 avatar Oct 31 '25 15:10 cjc25

https://github.com/googleapis/google-cloud-go/pull/13278 would fix, I think. @BrennaEpp

cjc25 avatar Oct 31 '25 15:10 cjc25