wpt.fyi icon indicating copy to clipboard operation
wpt.fyi copied to clipboard

/api/checks/azure failed with 500 Internal Server Error

Open foolip opened this issue 6 years ago • 9 comments

In https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=25580 (for https://github.com/web-platform-tests/wpt/pull/18162) the wpt.fyi hook step failed like this:

curl -f -s -S -d "artifact=safari-preview-affected-tests" -X POST https://wpt.fyi/api/checks/azure/25580
========================== Starting Command Output ===========================
[command]/bin/bash --noprofile --norc /home/vsts/work/_temp/5adb1782-5cfa-49c8-8bcb-4788be714def.sh
curl: (22) The requested URL returned error: 500 Internal Server Error

Finding the logs for this in GCP, I see:

2019-07-30 13:14:29.488 ICT Source branch: refs/pull/18162/merge
2019-07-30 13:14:29.488 ICT Trigger PR branch: reffy-reports/payment-request
2019-07-30 13:14:29.488 ICT Fetching https://dev.azure.com/web-platform-tests/wpt/_apis/build/builds/25580/artifacts
2019-07-30 13:14:29.504 ICT Failed to fetch artifacts for web-platform-tests/wpt build 25580
2019-07-30 13:14:29.504 ICT Get https://dev.azure.com/web-platform-tests/wpt/_apis/build/builds/25580/artifacts: API error 8 (urlfetch: CLOSED)

In other words, it looks like fetching the artifacts failed. Right now it looks like the artifacts are there.

Maybe we need to retry? @Hexcles how does this fit into your life cycle work? Is this an error that would be surfaced somewhere that CI maintainers would see it?

foolip avatar Jul 30 '19 11:07 foolip

I've invoked curl -f -s -S -d "artifact=safari-preview-affected-tests" -X POST https://wpt.fyi/api/checks/azure/25580 again manually just now, with success. So the problem was a transient one, where a retry would probably have worked.

foolip avatar Jul 30 '19 11:07 foolip

This happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=30018 too, losing the results of a whole run.

foolip avatar Sep 11 '19 13:09 foolip

An earlier case was in https://github.com/web-platform-tests/wpt.fyi/issues/1288.

foolip avatar Oct 23 '19 19:10 foolip

This happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=35048 too.

foolip avatar Oct 23 '19 19:10 foolip

It happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=40414 too.

Ping @Hexcles for triage.

foolip avatar Jan 20 '20 10:01 foolip

Stack is:

runtime error: invalid memory address or nil pointer dereference
at google.golang.org/appengine/panic (panic.go:513)
at github.com/web-platform-tests/wpt.fyi/api/azure.processBuild (webhook.go:74)
at github.com/web-platform-tests/wpt.fyi/api/azure.notifyHandler (notify.go:31)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at github.com/web-platform-tests/wpt.fyi/shared.WrapHSTS.func1 (routing.go:38)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at github.com/gorilla/mux.(*Router).ServeHTTP (mux.go:212)
at net/http.(*ServeMux).ServeHTTP (server.go:2361)
at google.golang.org/appengine/internal.executeRequestSafely (api.go:162)
at google.golang.org/appengine/internal.handleHTTP (api.go:121)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at net/http.serverHandler.ServeHTTP (server.go:2741)
at net/http.(*conn).serve (server.go:1847)

stephenmcgruer avatar Jan 20 '20 17:01 stephenmcgruer

That would be https://github.com/web-platform-tests/wpt.fyi/blob/7fc66c6794278b302e289071eb0d56c968beb7ec/api/azure/webhook.go#L74

Based on https://github.com/web-platform-tests/wpt.fyi/blob/7fc66c6794278b302e289071eb0d56c968beb7ec/api/azure/webhook.go#L33, looks like build could be nil. IsMasterBranch does seem to check for that:

func (a *Build) IsMasterBranch() bool {
	return a != nil && a.SourceBranch == "refs/heads/master"
}

But then epochBranchesRegex.MatchString(build.SourceBranch) accesses SourceBranch directly, which likely panics.

stephenmcgruer avatar Jan 20 '20 17:01 stephenmcgruer

@Hexcles your PR changes the 500s into handled cases, but it still doesn't explain why they are happening, right? We are failing to get a build, but why? @foolip mentions that maybe we need retry logic above.

stephenmcgruer avatar Jan 22 '20 15:01 stephenmcgruer

@stephenmcgruer we know it's the Azure API that fails occasionally (potentially due to consistency issues / race conditions: we are trying to get the build right after it finishes).

And yes it's a good idea to have a retry here. I'm reopening this issue.

Hexcles avatar Jan 22 '20 16:01 Hexcles