pants icon indicating copy to clipboard operation
pants copied to clipboard

`experimental_shell_command` sometimes spins for a long time on "Scheduling" step

Open danxmoran opened this issue 1 year ago • 5 comments

Describe the bug

I've been trying to plug our front-end build process into Pants in a minimal way so we can include static assets in our container builds. I have an experimental_shell_command that runs yarn install as part of this. The command ran once successfully, but e ever since that point when I try to re-run it Pants will spin for awhile on:

Scheduling: Running experimental_shell_command src/projects:yarn-install

When I run with -ldebug I see:

22:19:57.64 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=3
22:19:58.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=4
22:19:59.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=5
22:20:00.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=6
22:20:01.66 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=7
22:20:02.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=8
22:20:03.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=9
22:20:04.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=10
22:20:05.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=11
22:20:06.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=12
22:20:07.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=13
22:20:08.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=14
22:20:09.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=15
22:20:10.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=16
22:20:11.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=17
22:20:12.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=18
22:20:13.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=19
22:20:14.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=20
22:20:15.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=21
22:20:16.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=22
22:20:17.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=23
22:20:18.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=24
22:20:19.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=25
22:20:20.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=26
22:20:21.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=27
22:20:22.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=28
22:20:23.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=29
22:20:24.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=30
22:20:25.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=31
22:20:26.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=32
22:20:27.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=33
22:20:28.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=34
22:20:29.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=35
22:20:30.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=36
22:20:31.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=37
22:20:32.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=38
22:20:33.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=39
22:20:34.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=40
22:20:35.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=41
22:20:36.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=42
22:20:37.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=43
22:20:38.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=44
22:20:39.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=45
22:20:40.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=46
22:20:41.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=47
22:20:42.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=48
22:20:43.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=49
22:20:44.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=50
22:20:45.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=51
22:20:46.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=52
22:20:47.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=53
22:20:48.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=54
22:20:49.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=55
22:20:50.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=56
22:20:51.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=57
22:20:52.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=58
22:20:53.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=59
22:20:54.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=60
22:20:55.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=61
22:20:56.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=62
22:20:57.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=63
22:20:58.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=64
22:20:59.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=65
22:21:00.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=66
22:21:01.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=67
22:21:02.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=68
22:21:03.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=69
22:21:04.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=70
22:21:05.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=71
22:21:06.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=72
22:21:07.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=73
22:21:08.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=74
22:21:09.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=75
22:21:10.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=76
22:21:11.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=77
22:21:12.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=78
22:21:13.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=79
22:21:14.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=80
22:21:15.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=81
22:21:16.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=82
22:21:17.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=83
22:21:18.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=84
22:21:19.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=85
22:21:20.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=86

The step eventually (after ~200s) finishes and moves on.

Pants version

v2.13.0

OS

MacOS

Additional info

https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2022_09_12_08_55_13_980_5f4261a10dfe4c219fc7bde55d1031de/

danxmoran avatar Sep 12 '22 12:09 danxmoran

@danxmoran : Would you mind re-running with --streaming-workunits-level=trace?

Part of the issue appears to be that the relevant process isn't marked level=LogLevel.DEBUG: https://github.com/pantsbuild/pants/blob/4d9fe544663082c4c5816945bc3ac457c43a153c/src/python/pants/backend/shell/shell_command.py#L225-L234 ... I'll get a change out to fix that.

stuhood avatar Sep 13 '22 20:09 stuhood

Opened #16848 : it fixes the display aspect of this issue, but not the fact that capturing inputs/outputs took a long time (...which should become more evident with a trace-level trace).

stuhood avatar Sep 13 '22 20:09 stuhood

From the additional trace-level trace that you provided offline, it looks like at least two things are happening:

  1. eager_fetch_action_cache is taking a very long time to determine that the cache entry is loadable locally.
    • I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.
  2. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.
    • Even after potentially figuring out why the workunit is detached, I don't really see a good workaround for this one today... but see below.

Both of these point to the fact that having tons of small files as inputs/outputs to processes is currently relatively inefficient. Using zips/tars is a workaround, but in the medium term it's possible that we'll want to support using FUSE for these use-cases (see).

stuhood avatar Sep 13 '22 21:09 stuhood

  1. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

I was wrong about #16848, so this is likely actually the root cause of the display issue: the stray setup_sandbox workunit runs for another ~100s after we've hit the cache for the process.

stuhood avatar Sep 13 '22 21:09 stuhood

  • I assume that you are running with cache_content_behavior="fetch" (the default)? If so, you might try capturing another trace with cache_content_behavior="validate" to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally use validate for a local cache.

@danxmoran confirmed that switching from "fetch" to "validate" dropped the time for the local cache hit by 95% (which is probably at least partially due to #16627: thanks!).

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.


2. There is a setup_sandbox workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.

The next step for this one is fuzzier... but it probably involves trying to get a repro in pantsbuild/pants that I can iterate on with logging.

stuhood avatar Sep 14 '22 21:09 stuhood

Without changing the semantics of CacheContentBehavior (because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists checks locally, and then a more explicit method (rather than abusing load_bytes_with) for content that doesn't exist locally.

Filed #17448 for this one.

stuhood avatar Nov 02 '22 22:11 stuhood

I apparently forgot to link it here, but I did set up a public repository that repros the current behavior: https://github.com/danxmoran/pants-yarn-install-repro

danxmoran avatar Nov 02 '22 22:11 danxmoran

@danxmoran I get a bunch of errors of the following form when I try to run :run-yarn-install, both with ./pants as well as pants_with_sources. Is this something you've seen before?

➤ YN0001: │ Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at new NodeError (node:internal/errors:393:5)
    at PassThrough.onclose (node:internal/streams/end-of-stream:142:30)
    at PassThrough.emit (node:events:513:28)
    at emitCloseNT (node:internal/streams/destroy:132:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

chrisjrn avatar Nov 08 '22 18:11 chrisjrn

~(it looks like the above is an issue on Mac OS w/ arm64; I'm trying under x86 emulation, and behaviour seems to be more in line with the above)~ Nope, under x86 emulation, I get the same issues, it just takes 6 times longer to get the errors :D

chrisjrn avatar Nov 08 '22 18:11 chrisjrn

(or not; see edited comment)

chrisjrn avatar Nov 08 '22 18:11 chrisjrn

What version of node and yarn do you have installed? I have node v16.14.0 and yarn v1.22.10.

(Note that yarn --version from within the repo will print 3.0.2 because of the config + vendored release, so I had to check from outside the repo to see what version is being used to bootstrap running the vendored version)

danxmoran avatar Nov 08 '22 19:11 danxmoran

chrisjrn@chrisjrns-MacBook-Pro ~ % node --version && yarn --version
v19.0.1
1.22.19

chrisjrn avatar Nov 08 '22 19:11 chrisjrn

Ugh, my guess is the newer node - are you able to downgrade?

danxmoran avatar Nov 08 '22 19:11 danxmoran

I actually upgraded form a reasonably old node (though I didn't check the version) in order to see if node version was the issue.

I'll see if I can use the same version of node as you

chrisjrn avatar Nov 08 '22 19:11 chrisjrn

Yup, this appears to be working now!

chrisjrn avatar Nov 08 '22 19:11 chrisjrn

After some experimentation, I found that there's been a few regressions in experimental_shell_command interplaying here. There's at least one commit on main (be00ed54ca4d68539c5d840bf2381b33603d747a) where the reproduction case runs surprisingly quickly. I'm trying to work out what's regressed since then so that we can get to a point where we can make it really fast (I think we can).

chrisjrn avatar Nov 10 '22 21:11 chrisjrn

So the good news is that the bug specific bug reported here was fixed on main at one point -- you can see that in be00ed5. It's fast, and with work on symlinks, it should be even more so.

The bad news is that d8c0488503d6832fdea6a83beeba3723a84a7ee9 triggered some other bug that is causing consistent cache misses for the yarn install command.

chrisjrn avatar Nov 11 '22 00:11 chrisjrn

(To reproduce on main: ./pants_from_sources -ldebug run :run-yarn-install . Look for:

Error loading process execution result from local cache: Computed digest for Snapshot

)

chrisjrn avatar Nov 11 '22 00:11 chrisjrn

Update: We've got this to a point where on be00ed5, we can get the 3rd build and onwards to run in ~3 seconds by installing node_modules as an immutable_input_digests value, provided you set the pantsd memory limit to 4GiB (which is unusually high, and not entirely explained by the 800k+ files in the node_modules directory).

@stuhood is going to look at #17542 to see if the memory issue has gone away on main before investigating higher memory usage.

chrisjrn avatar Nov 14 '22 20:11 chrisjrn

Checking back in on the current main, the cache miss issues have gone away when immutable_input_digests are used. Memory usage is still quite high, but I can successfully get an "instant" build if the pantsd memory limit is set to 2GiB instead of 4, which is somewhat of an improvement.

chrisjrn avatar Nov 15 '22 16:11 chrisjrn

And just for fun, I'm running on an unadorned main, and I think I'm now prepared to close this issue out:

I'm running time ./pants_from_sources run :run-yarn-install:

First run: ./pants_from_sources run :run-yarn-install 87.75s user 125.52s system 239% cpu 1:29.24 total

Second run: ./pants_from_sources run :run-yarn-install 8.78s user 26.67s system 274% cpu 12.913 total

Third run: ./pants_from_sources run :run-yarn-install 8.73s user 27.10s system 282% cpu 12.681 total

@danxmoran to reproduce, try using PANTS_SHA=0a3d096c5e6c2915274826e96b4076bc3a6dfab7. Let me know if you get similar results!

Based on my tests when immutable_input_digests are used, the third+ run should result in ~3s execution times for :run-yarn-install when #17282 is adequately in place, but my feeling is that the current state of main is probably fine to close out this issue. @stuhood?

chrisjrn avatar Nov 15 '22 17:11 chrisjrn

I think that we should probably still open a ticket about the memory usage, as it is surprising. But yea, this is closer to reasonable.

stuhood avatar Nov 15 '22 17:11 stuhood

@stuhood I agree about the memory usage needing a separate ticket, but IMHO performance even without the immutable inputs is probably tolerable at this point.

chrisjrn avatar Nov 15 '22 18:11 chrisjrn

I think we can close this one. @danxmoran please reopen if you're still experiencing untolerable performance on main.

chrisjrn avatar Nov 22 '22 17:11 chrisjrn