bazel-buildfarm
bazel-buildfarm copied to clipboard
when blob file miss,rbe client is stuck
buildfarm version: 2.4.0 android rbe: 0.57.0.4865132 buildfarm configuration:
- server: 3 k8s pods
- shard workers: more than 10 k8s pods, and execute workers act as CAS workers
A pod in the workers is faulty,so i have to delete the pod and re-create it with empty cache dir. But ContentAddressableStorage:
and ActionCache:
data in redis aren't delete. Then when android rbe client is stuck when it use buildfarm server to remote build, and client error log is:
cas.go:1399] Error downloading {blob file hash}/{blob file size}: rpc error: code = NotFound desc = No workers found.
cas.go:1408] Internal tool error - matching map entry
I found that remote-api only considers the possible NOT_FOUND
status returned by the GetActionResult, GetTree, and WaitExecution interfaces, but doesn't consider the NOT_FOUND
status of the download interface.
Question 1: maybe server must ensure that the result of
GetActionResult
is consistent with the result of download?
And then I set ensureOutputsPresent: true
to test(deleting blob files separately),but the first build is still stuck after delete blob file,and the result of GetActionResult
is still 200. And the second build is success,and the result of GetActionResult
is NotFound
.
Question 2: why the first build is still stuck when set ensureOutputsPresent: true
?
Question 3: I think can the CAS data stored by workers be stored in the Redis,similar to a bidirectional linked list, use the worker address as the key, the value is a cas list?
Supplement to Question 1: Why doesn‘t verify the existence of output's blob files in the GetActionResult
interface? Maybe just randomly select a worker from the cas work list to judge whether a single cas has blob file.
I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.
I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.
Hi,80degreeswest. Are your execute worker and cas worker together (this indicates that the blob file is lost, but the redis data is not synchronized)? Which version of bazel do you use?
I have previously tested this scenario with bazel 6.1 and it performed successfully. Of course, the android rbe we used was stuck.
In addition, I'd like to ask you how you deal with this problem.
Hi, @80degreeswest I noticed that https://github.com/bazelbuild/bazel-buildfarm/pull/976 could solve my problem, but not merge. I tested the efficiency of adding check disk storage before and after. It doesn't seem to add much time at the moment.
Yes I use cas+execute workers. I see this problem when my workers scale down. We use bazel 5.3.1. To work around it you can enable graceful shutdown, which is available in v2.6.1. This config will wait x seconds for any executions in progress to finish before shutting down the worker. Obviously not going to help if your worker is already broken but it will solve the issue in case of normal shutdown. https://github.com/bazelbuild/bazel-buildfarm/blob/main/examples/config.yml#L128.
I'm not sure what the state of that PR is. @luxe may be able to provide some more detail on if it would make sense to re-visit it. @shirchen do you have that change deployed in your cluster?