pants
pants copied to clipboard
`experimental_shell_command` sometimes spins for a long time on "Scheduling" step
Describe the bug
I've been trying to plug our front-end build process into Pants in a minimal way so we can include static assets in our container builds. I have an experimental_shell_command
that runs yarn install
as part of this. The command ran once successfully, but e ever since that point when I try to re-run it Pants will spin for awhile on:
Scheduling: Running experimental_shell_command src/projects:yarn-install
When I run with -ldebug
I see:
22:19:57.64 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=3
22:19:58.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=4
22:19:59.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=5
22:20:00.65 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=6
22:20:01.66 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=7
22:20:02.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=8
22:20:03.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=9
22:20:04.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=10
22:20:05.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=11
22:20:06.67 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=12
22:20:07.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=13
22:20:08.68 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=14
22:20:09.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=15
22:20:10.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=16
22:20:11.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=17
22:20:12.69 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=18
22:20:13.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=19
22:20:14.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=20
22:20:15.70 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=21
22:20:16.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=22
22:20:17.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=23
22:20:18.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=24
22:20:19.71 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=25
22:20:20.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=26
22:20:21.72 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=27
22:20:22.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=28
22:20:23.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=29
22:20:24.73 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=30
22:20:25.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=31
22:20:26.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=32
22:20:27.74 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=33
22:20:28.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=34
22:20:29.75 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=35
22:20:30.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=36
22:20:31.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=37
22:20:32.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=38
22:20:33.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=39
22:20:34.76 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=40
22:20:35.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=41
22:20:36.77 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=42
22:20:37.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=43
22:20:38.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=44
22:20:39.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=45
22:20:40.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=46
22:20:41.78 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=47
22:20:42.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=48
22:20:43.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=49
22:20:44.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=50
22:20:45.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=51
22:20:46.79 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=52
22:20:47.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=53
22:20:48.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=54
22:20:49.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=55
22:20:50.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=56
22:20:51.80 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=57
22:20:52.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=58
22:20:53.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=59
22:20:54.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=60
22:20:55.81 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=61
22:20:56.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=62
22:20:57.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=63
22:20:58.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=64
22:20:59.82 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=65
22:21:00.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=66
22:21:01.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=67
22:21:02.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=68
22:21:03.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=69
22:21:04.83 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=70
22:21:05.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=71
22:21:06.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=72
22:21:07.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=73
22:21:08.84 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=74
22:21:09.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=75
22:21:10.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=76
22:21:11.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=77
22:21:12.85 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=78
22:21:13.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=79
22:21:14.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=80
22:21:15.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=81
22:21:16.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=82
22:21:17.86 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=83
22:21:18.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=84
22:21:19.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=85
22:21:20.87 [DEBUG] handle_workunits total=0 completed=0 started=0 finished=False calls=86
The step eventually (after ~200s) finishes and moves on.
Pants version
v2.13.0
OS
MacOS
Additional info
https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2022_09_12_08_55_13_980_5f4261a10dfe4c219fc7bde55d1031de/
@danxmoran : Would you mind re-running with --streaming-workunits-level=trace
?
Part of the issue appears to be that the relevant process isn't marked level=LogLevel.DEBUG
: https://github.com/pantsbuild/pants/blob/4d9fe544663082c4c5816945bc3ac457c43a153c/src/python/pants/backend/shell/shell_command.py#L225-L234 ... I'll get a change out to fix that.
Opened #16848 : it fixes the display aspect of this issue, but not the fact that capturing inputs/outputs took a long time (...which should become more evident with a trace
-level trace).
From the additional trace
-level trace that you provided offline, it looks like at least two things are happening:
-
eager_fetch_action_cache
is taking a very long time to determine that the cache entry is loadable locally.- I assume that you are running with
cache_content_behavior="fetch"
(the default)? If so, you might try capturing another trace withcache_content_behavior="validate"
to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally usevalidate
for a local cache.
- I assume that you are running with
- There is a
setup_sandbox
workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.- Even after potentially figuring out why the workunit is detached, I don't really see a good workaround for this one today... but see below.
Both of these point to the fact that having tons of small files as inputs/outputs to processes is currently relatively inefficient. Using zips/tars is a workaround, but in the medium term it's possible that we'll want to support using FUSE for these use-cases (see).
- There is a
setup_sandbox
workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.
I was wrong about #16848, so this is likely actually the root cause of the display issue: the stray setup_sandbox
workunit runs for another ~100s after we've hit the cache for the process.
- I assume that you are running with
cache_content_behavior="fetch"
(the default)? If so, you might try capturing another trace withcache_content_behavior="validate"
to see how much of a difference that makes. If it's significant, we can probably adjust that codepath to unconditionally usevalidate
for a local cache.
@danxmoran confirmed that switching from "fetch" to "validate" dropped the time for the local cache hit by 95% (which is probably at least partially due to #16627: thanks!).
Without changing the semantics of CacheContentBehavior
(because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by using exists
checks locally, and then a more explicit method (rather than abusing load_bytes_with
) for content that doesn't exist locally.
2. There is a
setup_sandbox
workunit (which is disconnected from any relevant parent workunit: likely a bug), which could only really be running for so long because it is attempting to materialize a very large set of inputs for a process.
The next step for this one is fuzzier... but it probably involves trying to get a repro in pantsbuild/pants
that I can iterate on with logging.
Without changing the semantics of
CacheContentBehavior
(because we do still want to guarantee that all content exists locally when "fetch" is set), this suggests that we should optimize the implementation of "fetch" to be more like the implementation of "validate" -- by usingexists
checks locally, and then a more explicit method (rather than abusingload_bytes_with
) for content that doesn't exist locally.
Filed #17448 for this one.
I apparently forgot to link it here, but I did set up a public repository that repros the current behavior: https://github.com/danxmoran/pants-yarn-install-repro
@danxmoran I get a bunch of errors of the following form when I try to run :run-yarn-install
, both with ./pants
as well as pants_with_sources
. Is this something you've seen before?
➤ YN0001: │ Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
at new NodeError (node:internal/errors:393:5)
at PassThrough.onclose (node:internal/streams/end-of-stream:142:30)
at PassThrough.emit (node:events:513:28)
at emitCloseNT (node:internal/streams/destroy:132:10)
at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
~(it looks like the above is an issue on Mac OS w/ arm64; I'm trying under x86 emulation, and behaviour seems to be more in line with the above)~ Nope, under x86 emulation, I get the same issues, it just takes 6 times longer to get the errors :D
(or not; see edited comment)
What version of node
and yarn
do you have installed? I have node
v16.14.0 and yarn
v1.22.10.
(Note that yarn --version
from within the repo will print 3.0.2 because of the config + vendored release, so I had to check from outside the repo to see what version is being used to bootstrap running the vendored version)
chrisjrn@chrisjrns-MacBook-Pro ~ % node --version && yarn --version
v19.0.1
1.22.19
Ugh, my guess is the newer node
- are you able to downgrade?
I actually upgraded form a reasonably old node (though I didn't check the version) in order to see if node version was the issue.
I'll see if I can use the same version of node as you
Yup, this appears to be working now!
After some experimentation, I found that there's been a few regressions in experimental_shell_command
interplaying here. There's at least one commit on main (be00ed54ca4d68539c5d840bf2381b33603d747a) where the reproduction case runs surprisingly quickly. I'm trying to work out what's regressed since then so that we can get to a point where we can make it really fast (I think we can).
So the good news is that the bug specific bug reported here was fixed on main
at one point -- you can see that in be00ed5. It's fast, and with work on symlinks, it should be even more so.
The bad news is that d8c0488503d6832fdea6a83beeba3723a84a7ee9 triggered some other bug that is causing consistent cache misses for the yarn install
command.
(To reproduce on main: ./pants_from_sources -ldebug run :run-yarn-install
. Look for:
Error loading process execution result from local cache: Computed digest for Snapshot
)
Update: We've got this to a point where on be00ed5, we can get the 3rd build and onwards to run in ~3 seconds by installing node_modules
as an immutable_input_digests
value, provided you set the pantsd
memory limit to 4GiB (which is unusually high, and not entirely explained by the 800k+ files in the node_modules
directory).
@stuhood is going to look at #17542 to see if the memory issue has gone away on main
before investigating higher memory usage.
Checking back in on the current main
, the cache miss issues have gone away when immutable_input_digests
are used. Memory usage is still quite high, but I can successfully get an "instant" build if the pantsd
memory limit is set to 2GiB instead of 4, which is somewhat of an improvement.
And just for fun, I'm running on an unadorned main
, and I think I'm now prepared to close this issue out:
I'm running time ./pants_from_sources run :run-yarn-install
:
First run:
./pants_from_sources run :run-yarn-install 87.75s user 125.52s system 239% cpu 1:29.24 total
Second run:
./pants_from_sources run :run-yarn-install 8.78s user 26.67s system 274% cpu 12.913 total
Third run:
./pants_from_sources run :run-yarn-install 8.73s user 27.10s system 282% cpu 12.681 total
@danxmoran to reproduce, try using PANTS_SHA=0a3d096c5e6c2915274826e96b4076bc3a6dfab7
. Let me know if you get similar results!
Based on my tests when immutable_input_digests
are used, the third+ run should result in ~3s execution times for :run-yarn-install
when #17282 is adequately in place, but my feeling is that the current state of main
is probably fine to close out this issue. @stuhood?
I think that we should probably still open a ticket about the memory usage, as it is surprising. But yea, this is closer to reasonable.
@stuhood I agree about the memory usage needing a separate ticket, but IMHO performance even without the immutable inputs is probably tolerable at this point.
I think we can close this one. @danxmoran please reopen if you're still experiencing untolerable performance on main
.