ClusterRunner Git fetches should be retried on failure

Occasionally a git fetch will fail due to network flakiness. It's appropriate to do retries with backoff on this type of failure similar to how we already do retries with direct network requests.

Example failure log is below:

[2016-03-22 14:48:46.944] 6008 NOTICE  Bld411-Preparat git             Command exited with non-zero exit code.
Command: export GIT_ASKPASS="/home/jenkins/.clusterrunner/dist/bin/git_askpass.sh"; export GIT_SSH="/home/jenkins/.clusterrunner/dist/bin/git_ssh.sh"; export PROJECT_DIR="/tmp/clusterrunner_build_symlinks/ff0bffbd-b207-4c83-92d7-42ffad47a4dd"; export GIT_SSH_ARGS="-o BatchMode=yes -o StrictHostKeyChecking=no"; git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
Exit code: 128
Console output: fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly


[2016-03-22 14:48:46.944] 6008 WARNING Bld411-Preparat build           Build 411 failed: Could not fetch specified branch "refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1" from remote "origin". Command: "git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1"
Output: "fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly
"
[2016-03-22 14:48:46.944] 6008 DEBUG   Bld411-Preparat build_fsm       Build 411 transitioned from PREPARING to ERROR
[2016-03-22 14:48:46.944] 6008 ERROR   Bld411-Preparat build_request_h Could not handle build request for build 411.
Traceback (most recent call last):
  File "/home/jenkins/ClusterRunnerBuild/app/master/build_request_handler.py", line 103, in _prepare_build_async
    build.prepare(self._subjob_calculator)
  File "/home/jenkins/ClusterRunnerBuild/app/master/build.py", line 137, in prepare
    self.project_type.fetch_project()
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 105, in fetch_project
    self._fetch_project()
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 179, in _fetch_project
    error_msg='Could not fetch specified branch "{}" from remote "{}".'.format(self._branch, self._remote)
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 231, in _execute_git_command_in_repo_and_raise_on_failure
    return self._execute_and_raise_on_failure(command, error_msg, cwd=self._repo_directory, env_vars=env_vars)
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 121, in _execute_and_raise_on_failure
    raise RuntimeError('{} Command: "{}"\nOutput: "{}"'.format(message, command, output))
RuntimeError: Could not fetch specified branch "refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1" from remote "origin". Command: "git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1"
Output: "fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly
"

Mar 22 '16 22:03 josephharrington

Actually this may not be network flakiness since the output specifically says it "Could not fetch specified branch". Retries may still help since executing this fetch command manually a little while later did work. The root cause seems like some weirdness on the git server side though.

Just for reference on the necessary span of retries -- there were four builds started on this same hash and they all failed for the same reason over the course of about 12 seconds. Any retries would need occur over a longer period than 12 seconds.

I'd still rather find out the root cause on why this ref is not available in the first place.

Mar 22 '16 23:03 josephharrington

Have we seen this lately?

Jan 11 '18 23:01 cmcginty

Not that I know of. This falls under general network robustness improvements.

Jan 12 '18 00:01 josephharrington

ClusterRunner ClusterRunner copied to clipboard

Git fetches should be retried on failure

ClusterRunner
ClusterRunner copied to clipboard