/api/checks/azure failed with 500 Internal Server Error
In https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=25580 (for https://github.com/web-platform-tests/wpt/pull/18162) the wpt.fyi hook step failed like this:
curl -f -s -S -d "artifact=safari-preview-affected-tests" -X POST https://wpt.fyi/api/checks/azure/25580
========================== Starting Command Output ===========================
[command]/bin/bash --noprofile --norc /home/vsts/work/_temp/5adb1782-5cfa-49c8-8bcb-4788be714def.sh
curl: (22) The requested URL returned error: 500 Internal Server Error
Finding the logs for this in GCP, I see:
2019-07-30 13:14:29.488 ICT Source branch: refs/pull/18162/merge
2019-07-30 13:14:29.488 ICT Trigger PR branch: reffy-reports/payment-request
2019-07-30 13:14:29.488 ICT Fetching https://dev.azure.com/web-platform-tests/wpt/_apis/build/builds/25580/artifacts
2019-07-30 13:14:29.504 ICT Failed to fetch artifacts for web-platform-tests/wpt build 25580
2019-07-30 13:14:29.504 ICT Get https://dev.azure.com/web-platform-tests/wpt/_apis/build/builds/25580/artifacts: API error 8 (urlfetch: CLOSED)
In other words, it looks like fetching the artifacts failed. Right now it looks like the artifacts are there.
Maybe we need to retry? @Hexcles how does this fit into your life cycle work? Is this an error that would be surfaced somewhere that CI maintainers would see it?
I've invoked curl -f -s -S -d "artifact=safari-preview-affected-tests" -X POST https://wpt.fyi/api/checks/azure/25580 again manually just now, with success. So the problem was a transient one, where a retry would probably have worked.
This happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=30018 too, losing the results of a whole run.
An earlier case was in https://github.com/web-platform-tests/wpt.fyi/issues/1288.
This happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=35048 too.
It happened in https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=40414 too.
Ping @Hexcles for triage.
Stack is:
runtime error: invalid memory address or nil pointer dereference
at google.golang.org/appengine/panic (panic.go:513)
at github.com/web-platform-tests/wpt.fyi/api/azure.processBuild (webhook.go:74)
at github.com/web-platform-tests/wpt.fyi/api/azure.notifyHandler (notify.go:31)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at github.com/web-platform-tests/wpt.fyi/shared.WrapHSTS.func1 (routing.go:38)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at github.com/gorilla/mux.(*Router).ServeHTTP (mux.go:212)
at net/http.(*ServeMux).ServeHTTP (server.go:2361)
at google.golang.org/appengine/internal.executeRequestSafely (api.go:162)
at google.golang.org/appengine/internal.handleHTTP (api.go:121)
at net/http.HandlerFunc.ServeHTTP (server.go:1964)
at net/http.serverHandler.ServeHTTP (server.go:2741)
at net/http.(*conn).serve (server.go:1847)
That would be https://github.com/web-platform-tests/wpt.fyi/blob/7fc66c6794278b302e289071eb0d56c968beb7ec/api/azure/webhook.go#L74
Based on https://github.com/web-platform-tests/wpt.fyi/blob/7fc66c6794278b302e289071eb0d56c968beb7ec/api/azure/webhook.go#L33, looks like build could be nil. IsMasterBranch does seem to check for that:
func (a *Build) IsMasterBranch() bool {
return a != nil && a.SourceBranch == "refs/heads/master"
}
But then epochBranchesRegex.MatchString(build.SourceBranch) accesses SourceBranch directly, which likely panics.
@Hexcles your PR changes the 500s into handled cases, but it still doesn't explain why they are happening, right? We are failing to get a build, but why? @foolip mentions that maybe we need retry logic above.
@stephenmcgruer we know it's the Azure API that fails occasionally (potentially due to consistency issues / race conditions: we are trying to get the build right after it finishes).
And yes it's a good idea to have a retry here. I'm reopening this issue.