bazel-buildfarm
bazel-buildfarm copied to clipboard
Build is stuck infinitely, operation polled
Client Log:
-> % bazel build ray_pkg -c dbg --jobs=32
INFO: Invocation ID: 8efd49d5-9adb-48c0-a3b3-3cee1e6ee547
DEBUG: /home/admin/lianxin.wlx/X/ray/bazel/ray_deps_setup.bzl:81:14: No implicit mirrors used because urls were explicitly provided, name = com_google_protobuf
DEBUG: /home/admin/lianxin.wlx/X/ray/bazel/ray_deps_setup.bzl:81:14: No implicit mirrors used because urls were explicitly provided, name = com_github_spdlog
DEBUG: /home/admin/lianxin.wlx/X/ray/bazel/ray_deps_setup.bzl:81:14: No implicit mirrors used because urls were explicitly provided, name = io_opencensus_proto
WARNING: /home/admin/.cache/bazel/_bazel_admin/65c94b62544ddd3fd6adce5fe1096d8b/external/com_github_brpc_brpc/BUILD:458:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_idl_options_proto: target '@com_github_brpc_brpc//:cc_brpc_idl_options_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
WARNING: /home/admin/.cache/bazel/_bazel_admin/65c94b62544ddd3fd6adce5fe1096d8b/external/com_github_brpc_brpc/BUILD:469:19: in cc_library rule @com_github_brpc_brpc//:cc_brpc_internal_proto: target '@com_github_brpc_brpc//:cc_brpc_internal_proto' depends on deprecated target '@com_google_protobuf//:cc_wkt_protos': Only for backward compatibility. Do not use.
INFO: Analyzed target //:ray_pkg (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[1 / 8] 3 actions, 0 running
[Sched] Linking python/ray/_raylet.so; 380s
[Sched] Linking raylet; 380s
[Sched] Linking gcs_server; 380s
Server Log:
May 06, 2022 7:38:49 PM build.buildfarm.instance.memory.MemoryInstance lambda$onDispatched$4
INFO: REQUEUEING memory/operations/fe4636fb-aebf-455d-b571-3ff0a98bd253
May 06, 2022 7:40:20 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(7d16eac7afd5e5288365f493d9d6d5e0d7b01cbf533a0e8e350311a9c722bb9e/144): memory/operations/a6e9a5ff-29a1-4069-bda3-ae968e338cd3
May 06, 2022 7:40:25 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(d94f1057912560f91b9e8edfd816696180cdb776867699ecb77fac7919fda1a3/144): memory/operations/2a9ad9a7-fb2a-4eb1-ace0-9e9b4bfbb172
May 06, 2022 7:40:30 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(502c05c7540d7748f3ee4299dcb68171476eff6d23c9b902f0cf1af6ce706546/144): memory/operations/0e355d8e-cbc1-4ace-9ac0-5de73f400520
May 06, 2022 7:40:30 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(d94f1057912560f91b9e8edfd816696180cdb776867699ecb77fac7919fda1a3/144): memory/operations/8ef6be8e-d28e-4eaa-a385-badee014b154
May 06, 2022 7:41:00 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(7d16eac7afd5e5288365f493d9d6d5e0d7b01cbf533a0e8e350311a9c722bb9e/144): memory/operations/f25914d5-5f47-41ac-ad1d-c2c687ef01f5
May 06, 2022 7:41:16 PM build.buildfarm.instance.server.AbstractServerInstance requeueOperation
INFO: memory::requeueOperation(502c05c7540d7748f3ee4299dcb68171476eff6d23c9b902f0cf1af6ce706546/144): memory/operations/20764363-b5ed-4504-a9af-e6bb95f0178b
One of the worker Log:
May 06, 2022 7:41:11 PM build.buildfarm.worker.memory.OperationQueueWorkerContext lambda$resumePoller$1
INFO: InputFetcher: poller: Completed Poll for memory/operations/20764363-b5ed-4504-a9af-e6bb95f0178b: OK
May 06, 2022 7:41:12 PM build.buildfarm.worker.memory.OperationQueueWorkerContext lambda$resumePoller$1
INFO: InputFetcher: poller: Completed Poll for memory/operations/20764363-b5ed-4504-a9af-e6bb95f0178b: OK
May 06, 2022 7:41:13 PM build.buildfarm.worker.memory.OperationQueueWorkerContext lambda$resumePoller$1
INFO: InputFetcher: poller: Completed Poll for memory/operations/20764363-b5ed-4504-a9af-e6bb95f0178b: OK
My Observation
502c05c7540d7748f3ee4299dcb68171476eff6d23c9b902f0cf1af6ce706546/144
kept requeueing again and again.
The operation was fetched by the worker, e.g. memory/operations/20764363-b5ed-4504-a9af-e6bb95f0178b
Then I don't know how to debug further. Could anyone help me?
Your worker is attempting to fetch inputs for that operation, due to the InputFetcher context for those polls. Can you run jstack on the worker process and report the results here?
@werkt Sure, here is the jstack output: jstack.log
Your local cas file cache is too small to hold your largest action: try increasing the cas storage on the worker.
It is stuck waiting for an entry expiration. I've been thinking about a way to detect this, but it is actually quite difficult to determine/resolve a deadlock on total file storage. At a minimum, I should have a deadline on the wait.
@werkt Thanks, I changed cas_cache_max_size_bytes
and everything goes well!
@werkt BTW, just curious, would you please say more about why it was stuck? I want to know more about buildfarm so next time I can debug by myself further.
- what is
entry expiration
, why it was waiting for it? - how was deadlock happen?
- what is
entry expiration
, why it was waiting for it?
Waiting on entry expiration is a poor name for the file cache waiting for any blob (content) to become unreferenced, i.e. unused, specifically when trying to perform executions. It waits for this in order to satisfy size constraints from the config. When the file cache would exceed the configured size requirement by inserting a new piece of content, it must wait for space to become available, in the form of an unreferenced entry that it can expire. When one appears (at the end of a sorted list of most to least recently used entries), it is immediately expired and discharged from the size of the file cache, to be replaced by your new content.
- how was deadlock happen?
Your deadlock occurred because within the scope of a set of actions, you created more demand for referenced file cache entries than would be available with your config. That meant that all items were referenced, and no single action execution would ever be letting go of their references to create available capacity for an action to proceed, hence deadlock. This is trivial to account for with a single action fetch, but becomes more complex in resolution with multiple fetchers. We need resource arbitration to account for this properly, but regardless of the arbitration, we will have behavior differences that can only resolve themselves with size increases, which will negate the purpose of the proper solution: to increase capacity. Admittedly, getting to the point of blaming that is too difficult, so it might make the arbitration necessary simply as a debugging tool.