ClusterRunner
ClusterRunner copied to clipboard
Git fetches should be retried on failure
Occasionally a git fetch will fail due to network flakiness. It's appropriate to do retries with backoff on this type of failure similar to how we already do retries with direct network requests.
Example failure log is below:
[2016-03-22 14:48:46.944] 6008 NOTICE Bld411-Preparat git Command exited with non-zero exit code.
Command: export GIT_ASKPASS="/home/jenkins/.clusterrunner/dist/bin/git_askpass.sh"; export GIT_SSH="/home/jenkins/.clusterrunner/dist/bin/git_ssh.sh"; export PROJECT_DIR="/tmp/clusterrunner_build_symlinks/ff0bffbd-b207-4c83-92d7-42ffad47a4dd"; export GIT_SSH_ARGS="-o BatchMode=yes -o StrictHostKeyChecking=no"; git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
Exit code: 128
Console output: fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly
[2016-03-22 14:48:46.944] 6008 WARNING Bld411-Preparat build Build 411 failed: Could not fetch specified branch "refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1" from remote "origin". Command: "git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1"
Output: "fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly
"
[2016-03-22 14:48:46.944] 6008 DEBUG Bld411-Preparat build_fsm Build 411 transitioned from PREPARING to ERROR
[2016-03-22 14:48:46.944] 6008 ERROR Bld411-Preparat build_request_h Could not handle build request for build 411.
Traceback (most recent call last):
File "/home/jenkins/ClusterRunnerBuild/app/master/build_request_handler.py", line 103, in _prepare_build_async
build.prepare(self._subjob_calculator)
File "/home/jenkins/ClusterRunnerBuild/app/master/build.py", line 137, in prepare
self.project_type.fetch_project()
File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 105, in fetch_project
self._fetch_project()
File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 179, in _fetch_project
error_msg='Could not fetch specified branch "{}" from remote "{}".'.format(self._branch, self._remote)
File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 231, in _execute_git_command_in_repo_and_raise_on_failure
return self._execute_and_raise_on_failure(command, error_msg, cwd=self._repo_directory, env_vars=env_vars)
File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 121, in _execute_and_raise_on_failure
raise RuntimeError('{} Command: "{}"\nOutput: "{}"'.format(message, command, output))
RuntimeError: Could not fetch specified branch "refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1" from remote "origin". Command: "git fetch --update-head-ok origin refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1"
Output: "fatal: Couldn't find remote ref refs/merge-queue/scm/621dc194bfb4e8a923067a8ecb50e2da465ff7d1
fatal: The remote end hung up unexpectedly
"
Actually this may not be network flakiness since the output specifically says it "Could not fetch specified branch". Retries may still help since executing this fetch command manually a little while later did work. The root cause seems like some weirdness on the git server side though.
Just for reference on the necessary span of retries -- there were four builds started on this same hash and they all failed for the same reason over the course of about 12 seconds. Any retries would need occur over a longer period than 12 seconds.
I'd still rather find out the root cause on why this ref is not available in the first place.
Have we seen this lately?
Not that I know of. This falls under general network robustness improvements.