github-act-runner icon indicating copy to clipboard operation
github-act-runner copied to clipboard

BrokerMigration message received is rolling out that might cause a downtime until implementation

Open ChristopherHX opened this issue 1 year ago • 12 comments

To reliable implement this, I need access to the rollout of the service update.

Found in this behavior change issue: https://github.com/actions/runner/issues/3366#issuecomment-2197183513

ChristopherHX avatar Jun 29 '24 11:06 ChristopherHX

Starting from yesterday evening, my self-hosted runners do not pick up the jobs anymore, e.g. see https://github.com/cppfw/opros/actions/runs/9785008298

Could this be related to this breaking change?

igagis avatar Jul 04 '24 08:07 igagis

Yes this is indeed possible, but I need the runner log to be certain

If this is the case access to an repo of your org would help me, otherwise I'm still waiting beeing affected.

I assume all repos of your org has that feature enabled on the backend even if you would create a temporary one for me

ChristopherHX avatar Jul 04 '24 08:07 ChristopherHX

The log might contain a Ignoring incoming message of type: line that would confirm that this is preventing job from beeing run

ChristopherHX avatar Jul 04 '24 08:07 ChristopherHX

My minecraft-linux org also got the update [2024-07-04 09:06:11Z INFO MessageListener] BrokerMigration message received. Polling Broker for messages...

However not my self-hosted runners

ChristopherHX avatar Jul 04 '24 09:07 ChristopherHX

For one of my runners, these are the only logs I have for today:

Jul 04 09:07:12 stahl runner[2790276]: Failed to get message, waiting 10 sec before retry: http failure: Http GET Request finished 503 https://pipelines.actions.githubusercontent.com/eMc1GyGdYp1Pn3AIiLBt1AQy39VBhA0ak6WnGv2vWqR163Rx43/_apis/distributedtask/pools/1/messages?api-version=5.1-preview&sessionId=542f6715-7fe1-4f62-90b9-e0b3aff904d1
Jul 04 09:07:12 stahl runner[2790276]: Headers:
Jul 04 09:07:12 stahl runner[2790276]: Cache-Control: no-store
Jul 04 09:07:12 stahl runner[2790276]: Content-Length: 231
Jul 04 09:07:12 stahl runner[2790276]: Content-Type: text/html
Jul 04 09:07:12 stahl runner[2790276]: Date: Thu, 04 Jul 2024 09:07:12 GMT
Jul 04 09:07:12 stahl runner[2790276]: X-Msedge-Ref: Ref A: C7C25762D71C45169960299F1093BFA5 Ref B: STOEDGE1707 Ref C: 2024-07-04T09:07:12Z
Jul 04 09:07:12 stahl runner[2790276]: Body: `{ "message": "GitHub Actions is temporarily unavailable. Please visit https://www.githubstatus.com/ for the status of our services.", "ref": "Ref A: C7C25762D71C45169960299F1093BFA5 Ref B: STOEDGE1707 Ref C: 2024-07-04T09:07:12Z" }`

igagis avatar Jul 04 '24 09:07 igagis

Recovering from GitHub Outage might require runner service restart, not shure otherwise

ChristopherHX avatar Jul 04 '24 09:07 ChristopherHX

I restarted the runner service, but it didn't help

igagis avatar Jul 04 '24 09:07 igagis

I have no idea what happend on your end.

Are newly registred runners also broken for you?

The runner has an --trace flag for the run command that enables real verbosity of almost all http traffic, but these contain credentials that needs to be manually removed

Registering runners and running jobs is still working on my user and org, so this change mentioned in my original post has not been fully rolled out to me yet

ChristopherHX avatar Jul 04 '24 10:07 ChristopherHX

Maybe the breaking change is not involved here and the problem is something else... I'll try to observe more.

igagis avatar Jul 04 '24 10:07 igagis

@igagis FWIW We're seeing the same thing. We only see BrokerMigration message received. Polling Broker for messages... and the symptoms (runners not picking up any jobs). This is however with actions/runner. We'll see if we can enable some debugging flags too.

martijnbastiaan avatar Jul 04 '24 11:07 martijnbastiaan

My problem is gone now, perhaps it was some temporary outage of something, not related to this breaking change.

igagis avatar Jul 04 '24 13:07 igagis

No such luck here. Fingers crossed it fixes itself for us too.

martijnbastiaan avatar Jul 04 '24 14:07 martijnbastiaan

Thanks to @igagis this could be solved just before my cron job alert system has been impacted and sent an alert

As of one hour ago I see these log entries of v0.8.0, this means this has rolled out to my private test repo a few hours after I finished the update

golang_2c7158bb-9e5a-4316-b6fb-0f5e3b5550ec ( https://github.com/ChristopherHX/github-act-runner-test ): Warning: TaskAgentMessage.MessageType is RunnerJobRequest, which has not been properly tested due to missing access to test servers of the new protocol before rollout. Please report any failures to https://github.com/ChristopherHX/github-act-runner/issues.

ChristopherHX avatar Jul 25 '24 20:07 ChristopherHX