nativelink icon indicating copy to clipboard operation
nativelink copied to clipboard

Upload never finishes (or takes a **very** long time)

Open avdv opened this issue 1 year ago • 18 comments

I am currently trying to use nativelink 0.5.1 with buck2. It works fine when building our backend, but building the frontend on MacOS aarch64 never finishes since it is stuck in the re_upload phase:

[2024-08-30T09:32:37.761+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
...

[2024-08-30T09:59:15.457+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
...
[2024-08-30T10:05:15.465+00:00] Waiting on root//frontend:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]

(it's still running and I'll see if it finishes at all; I'll report back what happened)

I have started nativelink locally, using the basic config, just setting additional_environment.

In the build, there are these genrules:

genrule(
    name = "node_modules",
    srcs = [
        "package.json",
        "yarn.lock",
    ],
    out = "node_modules",
    cmd = "$(exe toolchains//:yarn) install && mv node_modules ../out",
    labels = [
        "yarn_install",  # causes this genrule to run locally (needs network access)
    ],
)

genrule(
    name = "tsc_generated",
    srcs = [
        ":generate_api_types",
        ":node_modules",
        "package.json",
        "tsconfig.json",
    ],
    out = "dist",
    cmd = """\
$(exe toolchains//:yarn) run tsc --outDir ${OUT} --rootDir . --declaration true
""",
    default_outs = ["dist"],
)

So, yes the node_modules output is indeed large:

$ fd . -tf buck-out/v2/gen/root/904931f735703749/frontend/__node_modules__/out/node_modules | wc -l
92822

node_modules is heavier than the universe

avdv avatar Aug 30 '24 10:08 avdv

(it's still running and I'll see if it finishes at all; I'll report back what happened)

It just failed because it ran out of disk space. This was on a Github Actions macos-latest-xlarge runner which, according to the docs, should have 14 GiB of SSD storage available. So it seems it used up all the space when trying to upload the node_modules directory.

avdv avatar Aug 30 '24 10:08 avdv

@avdv is it possible to create an isolated project setup that fails or point to something open source that uses buck2 / node that exhibits this failure?

adam-singer avatar Aug 30 '24 16:08 adam-singer

Hi @adam-singer, here's a repository which contains a reproducer: https://github.com/avdv/nl-repro/

I cancelled the workflow run after ~1h, see https://github.com/avdv/nl-repro/actions/runs/10664811435/job/29556846974

So, yes the node_modules output is indeed large:

$ fd . -tf buck-out/v2/gen/root/904931f735703749/frontend/__node_modules__/out/node_modules | wc -l
92822

BTW, these numbers are from my local Linux system. There were no symlinks involved, but maybe that is different on MacOS?

avdv avatar Sep 02 '24 11:09 avdv

Note, I also created a workflow running on Linux. It shows the same symptom: https://github.com/avdv/nl-repro/actions/runs/10679208408/job/29597917819

avdv avatar Sep 03 '24 09:09 avdv

Are we sure this is a NativeLink issue?

MarcusSorealheis avatar Sep 03 '24 21:09 MarcusSorealheis

Are we sure this is a NativeLink issue?

We are currently using bazel-remote-worker and BuildBuddy RE with our project. Both work fine for us (although bazel-remote-worker is very slow). That leads me to think the problem is related to NativeLink...

avdv avatar Sep 04 '24 06:09 avdv

@avdv I'm not totally sure about what happened here because we haven't seen it. My best guess would be a regression that we saw in 0.5.1 is the culprit, but I am not totally sure about the details here.

Can you attempt the same operation with 0.5.3 and let me know if the system behaves as expected?

If it works, feel free to close this issue.

MarcusSorealheis avatar Sep 05 '24 20:09 MarcusSorealheis

On second thought, we'll attempt to run the reproducer on 0.5.3 and report back. Stay tuned!

MarcusSorealheis avatar Sep 05 '24 20:09 MarcusSorealheis

On second thought, we'll attempt to run the reproducer on 0.5.3 and report back. Stay tuned!

Already did that today, I cancelled it after ~50 mins: https://github.com/avdv/nl-repro/actions/runs/10723077069/job/29735546354

avdv avatar Sep 05 '24 21:09 avdv

I started looking into this a couple days ago, but then had a conference to attend. I'll try to get some time tomorrow/this-weekend to deep-dive this. These kind of things usually end up being a config problem somewhere, but I'll keep you tuned.

allada avatar Sep 06 '24 06:09 allada

It appears the issue is that you are using the default nativelink container to run the jobs in, but this container is bare-bones and has pretty much nothing installed in it, not even bash. So buck2 tries to run:

/usr/bin/env bash -e buck-out/v2/gen/root/904931f735703749/__hello_world__/sh/genrule.sh

It ends up failing because bash is not a program known to where nativelink is deploying the command. What i would suggest doing is to instead use nativelink as a layer in your image, or ... since our published nativelink binaries have no dependencies are statically compiled, you can copy it from a host machine (ie: another container) and mount it into the container at runtime.

Buck2 did get a "I am not able to run this command" response, but seems to keep waiting anyway.

allada avatar Sep 06 '24 16:09 allada

Thank you for looking into it, @allada

It appears the issue is that you are using the default nativelink container to run the jobs in, but this container is bare-bones and has pretty much nothing installed in it, not even bash. So buck2 tries to run:

/usr/bin/env bash -e buck-out/v2/gen/root/904931f735703749/__hello_world__/sh/genrule.sh

It ends up failing because bash is not a program known to where nativelink is deploying the command.

OK, good point. I have now removed the hello_world target (which was just a left-over from buck2 init).

Also, I am now extracting the nativelink binary from the docker image, see https://github.com/avdv/nl-repro/actions/runs/10744420726 (BTW, it reports version 0.5.1 although I am using the 0.5.3 Docker image). There are no errors reported, so I suspect that the action is just not run but waiting for the upload to finish...

Note, in my actual project we were running nativelink on MacOS and in my initial repro on Linux (https://github.com/avdv/nl-repro/actions/runs/10679208408) I was building nativelink with nix (https://github.com/avdv/nl-repro/commit/fef57740c45b211d10ab6c66e084b11c587179b1#diff-fde0e5d64aae13964fdda6d47af304cf1a7015cbc17e440ac4a5e662ee1d875eR25)

avdv avatar Sep 06 '24 20:09 avdv

I've been looking into this and when I run it locally I get:

Action failed: root//:tsc_generated (genrule)
Remote command returned non-zero exit code 2
stdout:
yarn run v1.22.22
$ /tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/node_modules/.bin/tsc --outDir ../out/dist --rootDir . --declaration true
error TS18003: No inputs were found in config file '/tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/tsconfig.json'. Specified 'include' paths were '["src/**/*"]' and 'exclude' paths were '[]'.
Files:                          0
Lines of Library:               0
Lines of Definitions:           0
Lines of TypeScript:            0
Lines of JavaScript:            0
Lines of JSON:                  0
Lines of Other:                 0
Nodes of Library:               0
Nodes of Definitions:           0
Nodes of TypeScript:            0
Nodes of JavaScript:            0
Nodes of JSON:                  0
Nodes of Other:                 0
Identifiers:                    0
Symbols:                        7
Types:                         57
Instantiations:                 0
Memory used:               21216K
Assignability cache size:       0
Identity cache size:            0
Subtype cache size:             0
Strict subtype cache size:      0
Program time:               0.00s
printTime time:             0.00s
Emit time:                  0.00s
Total time:                 0.00s
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
stderr:
error Command failed with exit code 2.
Build ID: b45ebc64-6f60-496c-a8fb-a8fac0c30988
Network: (GRPC-SESSION-ID)
Jobs completed: 5. Time elapsed: 12.7s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD FAILED
Failed to build 'root//:tsc_generated (prelude//platforms:default#904931f735703749)'

I want to make sure I'm debugging the right thing here, so just to clarify, you are not even able to get to this stage, correct?

If that is the case, one thing I do see just shy of 100k files, 16k directories and 677 symlinks for a total of ~750Mb

This leads me to think it is possible that the worker is being bound by IO and/or kernel calls. github runners give famously slow disk performance. Given this, what I might suggest is trying to mount a tmpfs or ramfs then configuring the nativelink worker to use that mount instead. This may not be the best long term solution, but at least it'll tell us if the github runners are being disk io bound.

allada avatar Sep 07 '24 05:09 allada

I've been looking into this and when I run it locally I get:

Action failed: root//:tsc_generated (genrule)
Remote command returned non-zero exit code 2
stdout:
yarn run v1.22.22
$ /tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/node_modules/.bin/tsc --outDir ../out/dist --rootDir . --declaration true
error TS18003: No inputs were found in config file '/tmp/nativelink/work/8bbebf93-60fd-41c8-b369-768322fb2a7f/work/buck-out/v2/gen/root/904931f735703749/__tsc_generated__/srcs/tsconfig.json'. Specified 'include' paths were '["src/**/*"]' and 'exclude' paths were '[]'.
Files:                          0
Lines of Library:               0
Lines of Definitions:           0
Lines of TypeScript:            0
Lines of JavaScript:            0
Lines of JSON:                  0
Lines of Other:                 0
Nodes of Library:               0
Nodes of Definitions:           0
Nodes of TypeScript:            0
Nodes of JavaScript:            0
Nodes of JSON:                  0
Nodes of Other:                 0
Identifiers:                    0
Symbols:                        7
Types:                         57
Instantiations:                 0
Memory used:               21216K
Assignability cache size:       0
Identity cache size:            0
Subtype cache size:             0
Strict subtype cache size:      0
Program time:               0.00s
printTime time:             0.00s
Emit time:                  0.00s
Total time:                 0.00s
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
stderr:
error Command failed with exit code 2.
Build ID: b45ebc64-6f60-496c-a8fb-a8fac0c30988
Network: (GRPC-SESSION-ID)
Jobs completed: 5. Time elapsed: 12.7s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD FAILED
Failed to build 'root//:tsc_generated (prelude//platforms:default#904931f735703749)'

This is what I would expect to see too.

I want to make sure I'm debugging the right thing here, so just to clarify, you are not even able to get to this stage, correct?

Yes, correct :+1: .

Running the yarn command remotely is not really necessary, so I replaced the yarn command for tsc_generated with:

ls -lh node_modules/ >&2 ; mkdir ${OUT}; echo done > ${OUT}/stamp

(see https://github.com/avdv/nl-repro/commit/6a9865165169093026780e0036f3040645f3fbfe)

I ran buck build -v2,stderr --prefer-remote :tsc_generated and forgot about it... it ran for over 23 hours :open_mouth: on my local machine before I cancelled it.

$ buck2 build -v2,stderr --prefer-remote :tsc_generated
File changed: root//.#BUCK
File changed: root//BUCK
Build ID: 03766e8d-435a-4427-b54e-6798674ffcc3
Network: (GRPC-SESSION-ID)
Command: build.                      Remaining: 1/8. Time elapsed: 23:08:26.1s
--------------------------------------------------------------------------------
root//:tsc_generated -- action (genrule) [re_upload]               23:08:26.1s

Running the command again, it succeeded in about 55 seconds:

$ buck2 build -v2,stderr --prefer-remote :tsc_generated
stderr for root//:tsc_generated (genrule):
total 0
drwxr-xr-x 1 claudio users   18 Sep  8 16:29 @aashutoshrathi
drwxr-xr-x 1 claudio users   18 Sep  8 16:29 @ampproject
drwxr-xr-x 1 claudio users 2.8K Sep  8 16:29 @babel
drwxr-xr-x 1 claudio users   92 Sep  8 16:29 @bazel
...
Build ID: 207239c9-5bea-45de-8edc-970b0cfae932
Network: (GRPC-SESSION-ID)
Jobs completed: 4. Time elapsed: 54.0s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 1, local: 0)
BUILD SUCCEEDED

If that is the case, one thing I do see just shy of 100k files, 16k directories and 677 symlinks for a total of ~750Mb

This leads me to think it is possible that the worker is being bound by IO and/or kernel calls. github runners give famously slow disk performance. Given this, what I might suggest is trying to mount a tmpfs or ramfs then configuring the nativelink worker to use that mount instead. This may not be the best long term solution, but at least it'll tell us if the github runners are being disk io bound.

Since it also happens on my machine, I don't think this is the issue here. Rather it looks like it got stuck somehow, although everything is already uploaded, since a subsequent run "immediately" succeeds...

Here's the latest run on Github: https://github.com/avdv/nl-repro/actions/runs/10760776248/job/29839142839

avdv avatar Sep 08 '24 14:09 avdv

Oh, I wounder if you are hitting max open files limits.

try running (in same shell):

ulimit -n 65000

Then also increase max_open_files in the nativelink json config to something very high as well, see if that helps.

You are probably running on ?mac, and I believe it has much lower default limits than linux.

allada avatar Sep 08 '24 15:09 allada

Oh, I wounder if you are hitting max open files limits.

try running (in same shell):

ulimit -n 65000

Then also increase max_open_files in the nativelink json config to something very high as well, see if that helps.

You are probably running on ?mac, and I believe it has much lower default limits than linux.

No, I am using Linux (NixOS actually). File open descriptor limits are already set quite high, at least the hard limit is:

$ ulimit -H -n
524288
$ ulimit -S -n
1024

I tried to set the soft limit to 65000 anyway, and also changed the global.max_open_files setting to 61512, rm -rf /tmp/nativelink, restarted nativelink, run buck2 clean and buck2 build :tsc_generated but the issue persists.

avdv avatar Sep 08 '24 16:09 avdv

BTW, I also tried to increase the limit on MacOS too (https://github.com/avdv/nl-repro/commits/main/) . On the first run (https://github.com/avdv/nl-repro/actions/runs/10768327053/job/29857259444) it helped and I got:

[2024-09-09T07:24:41.843+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_upload]
[2024-09-09T07:24:48.941+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:24:55.945+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:03.040+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:10.041+00:00] Waiting on root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) -- action (genrule) [re_execute]
[2024-09-09T07:25:15.630+00:00] Action failed: root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) (genrule)
[2024-09-09T07:25:15.630+00:00] Remote command returned non-zero exit code 127
[2024-09-09T07:25:15.630+00:00] Stdout: <empty>
[2024-09-09T07:25:15.630+00:00] Stderr:
buck-out/v2/gen/root/200212f73efcd57d/__tsc_generated__/sh/genrule.sh: line 5: yarn: command not found

[2024-09-09T07:25:15.643+00:00] Cache hits: 0%
[2024-09-09T07:25:15.643+00:00] Commands: 3 (cached: 0, remote: 2, local: 1)
[2024-09-09T07:25:15.643+00:00] Network: (GRPC-SESSION-ID)
[2024-09-09T07:25:15.644+00:00]
[2024-09-09T07:25:15.644+00:00] BUILD ERRORS (1)
[2024-09-09T07:25:15.644+00:00] The following actions failed during the execution of this command:
[2024-09-09T07:25:15.644+00:00] Action failed: root//:tsc_generated (prelude//platforms:default#200212f73efcd57d) (genrule)
[2024-09-09T07:25:15.644+00:00] Remote command returned non-zero exit code 127
...

So I replaced the yarn command with ls -lh node_modules/ >&2 ; mkdir ${OUT}; echo done > ${OUT}/stamp similar to the linux branch and it got stuck again (https://github.com/avdv/nl-repro/actions/runs/10768573202/job/29857960048) :sob:

avdv avatar Sep 09 '24 07:09 avdv

Remember you now need to install yarn.

~~If you run buck2 a second time does it start working? When I was testing this, I did notice that it looked like buck2 stopped talking to nativelink after the first handshake, but didn't think much of it, so restarted the buck2 build and it started working. Is this similar to what you are seeing?~~

Edit: It appears that eventually buck2 does upload the files and it does execute.

FYI: The fact that it even got to:

buck-out/v2/gen/root/200212f73efcd57d/__tsc_generated__/sh/genrule.sh: line 5: yarn: command not found

Means that nativelink did try to execute the command.

I suggest running nativelink with RUST_LOG=info. This will be very verbose, but will give a lot of info into what the worker/scheduler/cas is doing

allada avatar Sep 09 '24 16:09 allada

Remember you now need to install yarn.

~If you run buck2 a second time does it start working? When I was testing this, I did notice that it looked like buck2 stopped talking to nativelink after the first handshake, but didn't think much of it, so restarted the buck2 build and it started working. Is this similar to what you are seeing?~

Edit: It appears that eventually buck2 does upload the files and it does execute.

FYI: The fact that it even got to:

buck-out/v2/gen/root/200212f73efcd57d/__tsc_generated__/sh/genrule.sh: line 5: yarn: command not found

Means that nativelink did try to execute the command.

Yes, that was an expected error and not a failed upload -- but let's forget about yarn, I already replaced it with a command that should succeed.

I suggest running nativelink with RUST_LOG=info. This will be very verbose, but will give a lot of info into what the worker/scheduler/cas is doing

I just stumbled over https://github.com/facebook/buck2/pull/986 which seems related. I'll try to see if this helps and report back.

avdv avatar Jun 17 '25 07:06 avdv

This issue was resolved with removing atime a while back, ahead of 0.6.0 here.

MarcusSorealheis avatar Jul 19 '25 05:07 MarcusSorealheis