docs icon indicating copy to clipboard operation
docs copied to clipboard

Bad concurrency for prod deploys

Open janbrasna opened this issue 10 months ago • 5 comments

With many PRs merged in succession and the time taken in CI before checking out and trying to push to gh-pages after building, if there are more jobs running at the same time, you obviously run into the issue:

[gh-pages d8c30f64] Deployed with mkdocs, version 1.1.2 from /home/circleci/.local/share/virtualenvs/code-6yRgnUSz/lib/python3.8/site-packages/mkdocs (Python 3.8)
 552 files changed, 859 insertions(+), 859 deletions(-)
To github.com:fastlane/docs.git
 ! [rejected]          gh-pages -> gh-pages (fetch first)
error: failed to push some refs to '[email protected]:fastlane/docs.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Exited with code exit status 1

So you effectively end up not having the merged changeset published at that point. You can only hope the next push to master won't take too long to happen, to incorporate all the previous (failed) deploys to prod with it… 🤷

I'm not a CircleCI expert so take this with a pinch of salt, but… it seems the earliest commit "wins" here trying to deploy to prod, whereas normally you'd have the most recent cancelling the previous ones and eventually "winning" in the priority to deploy, not being blocked by the previous ones running concurrently to cause conflicts at the end.

janbrasna avatar Apr 20 '24 18:04 janbrasna

Interesting issue! Despite being less ideal, in this case I think we could fix it by having newest commits cancelling previous ongoing builds 👀 that'd effectively solve the problem and I don't see significant drawbacks.

Not sure how to achieve this with CircleCI though, and I won't have time to investigate this any time soon 😥 happy to review PRs or other changes in the meantime though!

rogerluan avatar Jun 04 '24 03:06 rogerluan

I'm used to the behaviour needed in GHA but it seems it's not exactly that straightforward in CircleCI:

  • https://discuss.circleci.com/t/limiting-concurrent-workflows-and-jobs/44699
  • https://discuss.circleci.com/t/avoid-concurrent-workflows-on-main-branch/46148/2
  • https://discuss.circleci.com/t/workaround-auto-cancel-redundant-builds-on-the-default-branch/39468/19 🤦‍♂️

So my take would simply be: https://github.com/fastlane/docs/blob/54969f497ed79d396434ffd2e4a77bb21dcce8a6/scripts/ci/deploy.sh#L48-L49 --force

but only in master context / publish CI, not when run otherwise, manually/localy etc. as there might be more users of the script — so I'm not confident to just propose -f there and call it a day. Leaving that to others to come up with something maybe more sophisticated;]

(This would be still far from perfect, as that doesn't prefer the build that starts last, but one that finishes last, and that's a huge difference;)… throw in some timeout, connection/performance or cache woes like lately, and you can have an older commit overwriting the output of a newer one just by getting stuck for a bit longer in there…) 🤷‍♂️

janbrasna avatar Jun 06 '24 23:06 janbrasna

Thanks for digging that info for CircleCI. It seems like they don't offer "auto cancel builds" which's kinda underwhelming 🤕 I wouldn't expect that.

Some alternative solutions:

  • Do the deploy in a different CI (probably possible in their free tier, given that we barely deploy), even e.g. GHA.
  • Use -f but then also have a cron job that re-deploys once a day just in case 🤷
  • Restart the deployment in case it fails during that step? Basically catch the error, and treat it by retrying. Retry a given amount of times, e.g. 3, 5…
  • Only deploy when creating tags (I dislike this option as it actually decreases the deployment frequency and adds an extra step for us maintainers to deploy changes 🙈 )

Thoughts?

rogerluan avatar Jun 08 '24 20:06 rogerluan

Yea we've had race conditions e.g. where a workflow would need a docker built from the same sha that might not have already been published to the registry, so the cron fallback for failed pipelines sounds uncomfortably familiar;]

The build is simple enough to be pushed straight to a deployment environment via GHA, getting rid of the gh-pages branch and its underlying git tree completely, and I'd welcome that — but I don't think you can depend GHA running only if previous checks i.e. CircleCI build&test pass. The containerised fastlanetools/ci test image is just docker anyways so that shouldn't be too prohibitive to move that also to GHA, keeping the whole CI just here… but it would mean disjoining pipelines from fastlane/fastlane which is kinda 💩…

janbrasna avatar Jun 10 '24 17:06 janbrasna

But the problem is pretty trivial in this case. The bundler woes slowed down the CI and it took ~10mins and more from initial checkout to the actual switch & commit step, so before resorting to bigger changes or force pushing I'd just try #1250 adding an extra fetch — to check out fresh gh-pages tip instead of the head that's been lying around for minutes already… (at the same time the current bundler version resolves take only seconds, so that should help avoiding conflicts too…)

janbrasna avatar Jun 10 '24 18:06 janbrasna