continuous-integration icon indicating copy to clipboard operation
continuous-integration copied to clipboard

TLS handshake timeout when listing bazel versions from GCS

Open rickeylev opened this issue 2 years ago • 3 comments

I've been regularly seeing an error that looks like some issue with the CI scripts talking to GCS.

The error is a bit confusing because it looks like some problem uploading the test logs ("can't find file"), but also looks like some problem "listing bazel versions in GCS" (whatever that means).

Pressing retry on build kite almost always fixes this, so it's some sort of flake.

Agent: bk-windows-bt5g

run: https://buildkite.com/bazel/rules-python-python/builds/4779#_

bazel --output_user_root=C:/b test --flaky_test_attempts=3 --build_tests_only --local_test_jobs=8 --show_progress_rate_limit=5 --curses=yes --color=yes --terminal_columns=143 --show_timestamps --verbose_failures --jobs=30 --announce_rc --experimental_repository_cache_hardlinks --disk_cache= --experimental_build_event_json_file_path_conversion=false --build_event_json_file=C:\temp\tmpikfc45yc\test_bep.json --google_default_credentials --remote_cache=remotebuildexecution.googleapis.com --remote_instance_name=projects/bazel-untrusted/instances/default_instance --remote_timeout=60 --remote_max_connections=200 --remote_default_platform_properties=properties:{name:"cache-silo-key" value:"6a21cacbec775043b8cb5b49849575502cf8f7a8f5d7f28ce34e6c5d2982f753"} --remote_download_toplevel --test_env=LocalAppData --test_env=BAZELISK_USER_AGENT -- ...
--
  | C:\temp\tmpikfc45yc\bazelci-agent.exe artifact upload --delay=5 --mode=buildkite --build_event_json_file=C:\temp\tmpikfc45yc\test_bep.json
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | May 04 17:16:00.138 ERROR bazelci_agent::artifact::upload: The system cannot find the file specified. (os error 2)
  | Error: The system cannot find the file specified. (os error 2)
  | Exception in thread Thread-1:
  | Traceback (most recent call last):
  | File "C:\python3\lib\threading.py", line 973, in _bootstrap_inner
  | self.run()
  | File "C:\python3\lib\threading.py", line 910, in run
  | self._target(*self._args, **self._kwargs)
  | File "c:\b\bk-windows-bt5g\bazel\rules-python-python\bazelci.py", line 2424, in upload_test_logs_from_bep
  | execute_command(
  | File "c:\b\bk-windows-bt5g\bazel\rules-python-python\bazelci.py", line 2474, in execute_command
  | return subprocess.run(
  | File "C:\python3\lib\subprocess.py", line 528, in run
  | raise CalledProcessError(retcode, process.args,
  | subprocess.CalledProcessError: Command '['C:\\temp\\tmpikfc45yc\\bazelci-agent.exe', 'artifact', 'upload', '--delay=5', '--mode=buildkite', '--build_event_json_file=C:\\temp\\tmpikfc45yc\\test_bep.json']' returned non-zero exit status 1.
  | 2023/05/04 17:16:04 could not resolve the version 'latest' to an actual version number: unable to determine latest version: could not list Bazel versions in GCS bucket: could not list GCS objects at https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/: could not fetch https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/: Get "https://www.googleapis.com/storage/v1/b/bazel/o?delimiter=/": net/http: TLS handshake timeout
  | bazel test failed with exit code 1

rickeylev avatar May 04 '23 17:05 rickeylev

The underlying "could not resolve the version " issue is from Bazelisk. I'm surprised that bazelci-agent.exe fails, too.

fweikert avatar May 04 '23 18:05 fweikert

"could not resolve the version " issue is from Bazelisk

Can we retry in Bazelisk for such errors?

meteorcloudy avatar May 09 '23 12:05 meteorcloudy

It looks like a variation of this same problem occurs when Bazelisk downloads Bazel, too: https://buildkite.com/bazel/rules-python-python/builds/5493#018a1f23-af2d-4943-b2db-3013e7c3391f

Using Bazel version | 26m 56s
-- | --
  |  
  |  
  | bazel info output_base
  | 2023/08/22 21:26:02 Downloading https://releases.bazel.build/6.3.2/release/bazel-6.3.2-linux-x86_64...
  | 2023/08/22 21:52:58 could not download Bazel: could not copy from https://releases.bazel.build/6.3.2/release/bazel-6.3.2-linux-x86_64 to /var/lib/buildkite-agent/.cache/bazelisk/downloads/bazelbuild/bazel-6.3.2-linux-x86_64/bin/download450623830: stream error: stream ID 1; INTERNAL_ERROR

It indicates it took 26 minutes to execute that. Quite the grace period! That might be good (because it's doing retries and download resumption), or it might be bad (because its just trying once and simply timing out after 26m).

I wonder if its possible to pre-populate the bazelisk cache? I don't know how these VMs (or whatever they are) are setup, but if they had the bazelisk cache pre-populated with the commonly used bazel versions, then no download would be necessary, largely avoiding the issue (at the cost of potentially slower VM setup, I guess?)

FWIW, these sort of network issues aren't too uncommon. Internally, we'd see chocolately installs regularly fail because of all sorts of network issues.

rickeylev avatar Aug 23 '23 16:08 rickeylev