strider icon indicating copy to clipboard operation
strider copied to clipboard

Long prepare phase, refresh brings the server to a halt

Open knownasilya opened this issue 8 years ago • 7 comments

I've noticed this happening sometimes:

  1. Multiple commits come in
  2. I cancel the first ones so I don't have to wait twice as long
  3. The remaining job sometimes gets stuck in "prepare" phase.
  4. If I refresh the page now, It won't come back to the page, but I'll get a gateway timeout. Have to restart the server.

I'm not sure why it halts in "prepare", seems like that's the culprit here, and maybe an error isn't being handled correctly.

knownasilya avatar Feb 26 '16 20:02 knownasilya

My Strider instance HTTP server also starts timing out if multiple commit hooks come in at the same time and multiple projects are built. After a couple minutes it eventually seems to come back online (without a restart), but I'll see several 502s from my reverse proxy in the meantime if I try to load the Strider dashboard or send another commit webhook within those several minutes.

I'm not sure where the delay is, but I'm especially curious what's blocking the HTTP thread, I thought most of the prepare phase would be delegated to the workers?

SimonKaluza avatar Sep 13 '17 18:09 SimonKaluza

Did you enable concurrent builds? That should help with multiple projects

knownasilya avatar Sep 13 '17 18:09 knownasilya

@knownasilya yeah I'm at CONCURRENT_JOBS=4. Would that affect the Strider HTTP server though? I don't mind waiting for the jobs to complete, the problem is that some of the GitHub webhooks are being dropped due to timeouts.

SimonKaluza avatar Sep 13 '17 18:09 SimonKaluza

That's weird, maybe the timeout isn't sufficient for your proxy? The webhooks respond back to github almost instantly, once the job has been scheduled.

knownasilya avatar Sep 13 '17 19:09 knownasilya

I verified it's not problem with my reverse proxy by running curl localhost:3000 immediately after a project begins the test/deploy cycle... I can actually reproduce it just by manually triggering one job through Test and Deploy through the UI and then immediately running curl localhost:3000. The request will take considerably longer if even one job is being prepared (usually requests to the Strider index take approximately 1-2 seconds, if a job is being prepared the request will take approximately 30 seconds).

The curl localhost:3000 will take 3-4 minutes if 3-4 jobs are being started (even with 8 concurrent workers), which is too long for GitHub/BitBucket webhooks.

SimonKaluza avatar Sep 13 '17 20:09 SimonKaluza

I downgraded our server back to a much older version of Strider, and the problem is resolved. Not sure what Strider commits introduced this problem, but the old version we're running again now ( https://github.com/Strider-CD/strider/commit/84a6b878f0b1b3d3528d3f5f19251353f07b4ea7 ) works great.

SimonKaluza avatar Oct 02 '17 15:10 SimonKaluza

I've updated the simple-runner with additional debug statements, so if you have time to investigate in the future, please do, using DEBUG=strider* to see if there is a runner error. You'll have to update the simple-runner in the plugins.

knownasilya avatar Oct 02 '17 15:10 knownasilya