go icon indicating copy to clipboard operation
go copied to clipboard

proxy.golang.org: Intermittent TLS/Network errors with Google's Module Proxy

Open aaomidi opened this issue 1 year ago • 12 comments

Go version

go version go1.23.1 darwin/arm64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/amir/Library/Caches/go-build'
GOENV='/Users/amir/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/amir/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/amir/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/private/var/tmp/_bazel_amir/3dbd0b78d662a8a6e641b2d6e1f7442e/external/rules_go~~go_sdk~go_sdk'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/private/var/tmp/_bazel_amir/3dbd0b78d662a8a6e641b2d6e1f7442e/external/rules_go~~go_sdk~go_sdk/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/amir/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/private/var/tmp/_bazel_amir/3dbd0b78d662a8a6e641b2d6e1f7442e/execroot/_main/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/l8/54vw83s15sn1h80rkn3xxl1c0000gn/T/go-build1795123535=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Environment:

  • Bazel with latest rules_go.
  • GitHub actions

Ran a bazel build ${some_target}

Note that I'm not sure if this problem is limited to bazel, or happens with normal go invocations as well.

What did you see happen?

Intermittently, some modules fail to download from Google's Go Module Proxy:

(22:42:44) ERROR: /home/runner/.bazel/external/gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_s3/BUILD.bazel:5:11: @@gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_s3//:s3 depends on @@gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_internal_checksum//:checksum in repository @@gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_internal_checksum which failed to fetch. no such package '@@gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_internal_checksum//': gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_internal_checksum: fetch_repo: github.com/aws/aws-sdk-go-v2/service/internal/[email protected]: Get "https://proxy.golang.org/github.com/aws/aws-sdk-go-v2/service/internal/checksum/@v/v1.4.1.info": net/http: TLS handshake timeout

What did you expect to see?

No errors.

aaomidi avatar Oct 09 '24 16:10 aaomidi

Just had it happen again:

/home/runner/work/spirl/spirl/spirlctl/BUILD.bazel:7:11: //spirlctl:spirlctl_lib depends on @@gazelle~~go_deps~com_github_spf13_cobra//:cobra in repository @@gazelle~~go_deps~com_github_spf13_cobra which failed to fetch. no such package '@@gazelle~~go_deps~com_github_spf13_cobra//': gazelle~~go_deps~com_github_spf13_cobra: fetch_repo: github.com/spf13/[email protected]: Get "https://proxy.golang.org/github.com/spf13/cobra/@v/v1.8.1.info": dial tcp 142.250.176.17:443: i/o timeout

aaomidi avatar Oct 09 '24 16:10 aaomidi

isn't this more likely to be a network issue in your local network?

seankhliao avatar Oct 09 '24 17:10 seankhliao

isn't this more likely to be a network issue in your local network?

This is happening in GitHub actions primarily. I've also confirmed its happening to someone who is using self-hosted AWS runners.

This issue seems similar to https://github.com/golang/go/issues/63562

aaomidi avatar Oct 09 '24 21:10 aaomidi

I've now added

common '--repo_env=GOPROXY=https://goproxy.io,https://proxy.golang.org,direct'

To my bazelrc, which effectively just changes GOPROXY. So far it seems we've not hit these errors. I'll keep an eye out on this and see if it fixes it. If it does, I suspect some load balancer at Google is struggling.

aaomidi avatar Oct 09 '24 22:10 aaomidi

I have not seen this issue happen since we’ve changed GOPROXY.

I’ll still keep an eye out on it.

aaomidi avatar Oct 11 '24 03:10 aaomidi

We experienced similar errors at Figma with rules_go until we set up a mirror to use as GOPROXY. https://github.com/golang/go/issues/63244

jfirebaugh avatar Oct 11 '24 03:10 jfirebaugh

We're hitting this issue today, from AWS EC2 hosted CI runners (not from github actions).

voxeljorge avatar Oct 14 '24 17:10 voxeljorge

I've switched us over to the following bazelrc incantation:

common '--repo_env=GOPROXY=https://proxy.golang.org|https://goproxy.io|direct'

aaomidi avatar Oct 14 '24 17:10 aaomidi

CC @golang/tools-team.

dmitshur avatar Oct 14 '24 20:10 dmitshur

We suspect this is a transient network issue. Are folks still experiencing this?

findleyr avatar Oct 16 '24 19:10 findleyr

I've for now switched to using multiple proxies, so I dont think I'll be able to know for certain..

aaomidi avatar Oct 19 '24 00:10 aaomidi

I just ran into this locally again when using only proxy.golang.org.

loading failure: com.google.devtools.build.lib.rules.repository.RepositoryFunction$AlreadyReportedRepositoryAccessException: gazelle~~go_deps~com_github_aws_aws_sdk_go_v2_service_internal_presigned_url: fetch_repo: github.com/aws/aws-sdk-go
-v2/service/internal/[email protected]: Get "https://proxy.golang.org/github.com/aws/aws-sdk-go-v2/service/internal/presigned-url/@v/v1.12.2.info": dial tcp 142.250.81.241:443: i/o timeout

aaomidi avatar Oct 22 '24 21:10 aaomidi

Thanks @aaomidi How frequently do you see this failure? If you retry, does it succeed?

hyangah avatar Oct 23 '24 15:10 hyangah

If I retry it does succeed. I don't see it that often, but that's partially because of heavy caching on my end. I no longer see this in CI after adding goproxy.io to the list of proxies it can retrieve data from.

I do think this is specifically an issue with how gazelle interacts with the go module proxy, on a subset of the Google LBs.

aaomidi avatar Oct 23 '24 15:10 aaomidi

I frequently see this on CI runners hosted on aws ec2. At least multiple times per day this past week. It does succeed when retrying, but sometimes may fail again if we retry

Ryang20718 avatar Oct 23 '24 20:10 Ryang20718

cc: @samthanawalla

ansaba avatar Nov 07 '24 16:11 ansaba

Since I don't think we'll be able to resolve network issues between aws and Google frontends, it seems like the most tractable solution to this problem would be to incorporate retries into the Go command (#28194). From skimming that issue, it is not clear whether we want to support this.

findleyr avatar Jan 16 '25 18:01 findleyr

This has nothing to do with AWS for what it’s worth. I’ve had this issue on GitHub actions which would be azure.

I don’t think it’s the network either. This issue is really mainly prominent in Bazel tooling.

aaomidi avatar Jan 16 '25 20:01 aaomidi

@aaomidi got it.

Absent more detailed traces indicating a problem with the go command or module proxy, this doesn't seem actionable from the perspective of the proxy.

findleyr avatar Jan 16 '25 21:01 findleyr

I’m not sure that’s true though. Especially given Bazel and go are both Google projects. I think realistically this is only really actionable by Google by involving the Bazel team with this bug.

Would a repro CI run help with this?

aaomidi avatar Jan 16 '25 23:01 aaomidi

I don’t think it’s the network either. This issue is really mainly prominent in Bazel tooling.

The errors reported in this issue all seem to be network timeouts. The fact that it mainly shows up in Bazel may just mean that Bazel is the main program contacting the Go proxy. After all, if you are using Bazel there isn't much reason for anything else to contact the Go proxy.

And that suggests that there are network problems between wherever people are running Bazel and the Go proxy.

I don't see how fixes to either Bazel or the Go proxy can affect that.

What might conceivably help is to collect the IP addresses that are failiing, both the IP address where Bazel is running and the IP address that it is using to contact the module proxy.

Or, yes, if there is a way to reproduce this reliably that would be helpful.

ianlancetaylor avatar Jan 17 '25 00:01 ianlancetaylor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

gopherbot avatar Mar 05 '25 19:03 gopherbot