bazel-buildfarm icon indicating copy to clipboard operation
bazel-buildfarm copied to clipboard

when blob file miss,rbe client is stuck

Open DarkMatterV opened this issue 1 year ago • 6 comments

buildfarm version: 2.4.0 android rbe: 0.57.0.4865132 buildfarm configuration:

  • server: 3 k8s pods
  • shard workers: more than 10 k8s pods, and execute workers act as CAS workers

A pod in the workers is faulty,so i have to delete the pod and re-create it with empty cache dir. But ContentAddressableStorage: and ActionCache: data in redis aren't delete. Then when android rbe client is stuck when it use buildfarm server to remote build, and client error log is:

cas.go:1399] Error downloading {blob file hash}/{blob file size}: rpc error: code = NotFound desc = No workers found.
cas.go:1408] Internal tool error - matching map entry

I found that remote-api only considers the possible NOT_FOUND status returned by the GetActionResult, GetTree, and WaitExecution interfaces, but doesn't consider the NOT_FOUND status of the download interface.

Question 1: maybe server must ensure that the result of GetActionResult is consistent with the result of download?

And then I set ensureOutputsPresent: true to test(deleting blob files separately),but the first build is still stuck after delete blob file,and the result of GetActionResult is still 200. And the second build is success,and the result of GetActionResult is NotFound.
Question 2: why the first build is still stuck when set ensureOutputsPresent: true?

Question 3: I think can the CAS data stored by workers be stored in the Redis,similar to a bidirectional linked list, use the worker address as the key, the value is a cas list?

DarkMatterV avatar Oct 16 '23 13:10 DarkMatterV

Supplement to Question 1: Why doesn‘t verify the existence of output's blob files in the GetActionResult interface? Maybe just randomly select a worker from the cas work list to judge whether a single cas has blob file.

DarkMatterV avatar Oct 18 '23 01:10 DarkMatterV

I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.

80degreeswest avatar Oct 18 '23 12:10 80degreeswest

I've experienced this issue with autoscaling - worker would scale down and bazel client would get stuck waiting forever.

Hi,80degreeswest. Are your execute worker and cas worker together (this indicates that the blob file is lost, but the redis data is not synchronized)? Which version of bazel do you use?

I have previously tested this scenario with bazel 6.1 and it performed successfully. Of course, the android rbe we used was stuck.

In addition, I'd like to ask you how you deal with this problem.

DarkMatterV avatar Oct 19 '23 02:10 DarkMatterV

Hi, @80degreeswest I noticed that https://github.com/bazelbuild/bazel-buildfarm/pull/976 could solve my problem, but not merge. I tested the efficiency of adding check disk storage before and after. It doesn't seem to add much time at the moment.

DarkMatterV avatar Oct 25 '23 09:10 DarkMatterV

Yes I use cas+execute workers. I see this problem when my workers scale down. We use bazel 5.3.1. To work around it you can enable graceful shutdown, which is available in v2.6.1. This config will wait x seconds for any executions in progress to finish before shutting down the worker. Obviously not going to help if your worker is already broken but it will solve the issue in case of normal shutdown. https://github.com/bazelbuild/bazel-buildfarm/blob/main/examples/config.yml#L128.

80degreeswest avatar Oct 25 '23 12:10 80degreeswest

I'm not sure what the state of that PR is. @luxe may be able to provide some more detail on if it would make sense to re-visit it. @shirchen do you have that change deployed in your cluster?

80degreeswest avatar Oct 25 '23 13:10 80degreeswest