rushstack icon indicating copy to clipboard operation
rushstack copied to clipboard

[rush] non-deterministic builds

Open benkeen opened this issue 10 months ago • 16 comments

Summary

We're getting non-deterministic builds on our CI.

We have a multi-stage build where Azure ensures the exact commit hash so separate stages always get exactly the same source. We have a preliminary stage doing a rush build of our library packages which populates the remote cache, then ~10 secondary stages that repeat that step, followed by whatever they're building (tests, linting etc).

Originally we could rely on the second runs of the rush builds would pull 100% from cache, but now cache misses are extremely common. I haven't been able to detect a pattern.

Repro steps

I don't have a reproduction besides our own setup. But for MS employees, please reach out to me at [email protected] and I'd be happy to get you access to see the problem.

Details

We recently updated pnpm to v9 and Rush to 5.149.1. However, this was actually issue we're seen from the very introduction of using Rush some 2 years ago. It used to be extremely rare, now it's very common. The problem may well be with pnpm, not Rush, but I'm not sure how to debug.

Standard questions

Question Answer
@microsoft/rush globally installed version? 5.149.1
rushVersion from rush.json? 5.149.1
useWorkspaces from rush.json? false
Operating system? Linux
Would you consider contributing a PR? Yes
Node.js version (node -v)? v18.20.4

benkeen avatar Feb 21 '25 19:02 benkeen

We're seeing a huge increase in build times because of this, btw. About ~~20~~ 15 minutes longer on average.

benkeen avatar Feb 26 '25 22:02 benkeen

I'm going to try downgrading to [email protected] (our prev version) and I'll report back here. Timeline-wise that looks like the likeliest cause.

benkeen avatar Feb 26 '25 23:02 benkeen

Yup, looks like the Rush update was the cause. It's only been a day and a half, but our pipeline times have been reduced by around 15 minutes (P80 values). I'll post again if that changes.

This'll re-introduce a jsonpath-plus CVE but we can live with that for a bit.

benkeen avatar Feb 28 '25 19:02 benkeen

Thanks for reporting this. We'll take a look at the recent changes and see if we can determine what might have caused a regression. If you have any ideas, let us know.

octogonz avatar Feb 28 '25 20:02 octogonz

Will do, thanks @octogonz! And if there's any info I can provide to help debug, let me know. For my part, I'd love Rush to output a file for all the "math" that goes into computing the cache hash for each task (perhaps it already exists). I could generate that on every agent, which I could diff to pinpoint the discrepancy.

benkeen avatar Feb 28 '25 20:02 benkeen

I found that the .rush/shrinkwrap-deps.json file had different content between runs due to peer deps being slightly different on a couple of dependencies. I'd screenshot it, but they're internal packages. Neither the package nor the peer deps are anything we've overridden or specified in our .pnpmfile.js, common-version.jsons, pnpm-config.json files - nothing like that.

Seems to me that's a pnpm issue, not Rush. I'll try an update to pnpm 10.

benkeen avatar Mar 19 '25 05:03 benkeen

Updating to [email protected] didn't make a difference.

cc @dmichon-msft

benkeen avatar Mar 19 '25 06:03 benkeen

In the end I simply patched the various oddities that were causing the non-determinism, and I'll set up a pipeline to run weekly to run back-to-back builds to watch for cache misses. Not ideal, but it's workable.

I'm also curious why the Rush update made things significantly worse, but I'll attempt another update soon. Probably the same cause though.

Think we can probably close this now - doesn't seem to be a Rush problem.

benkeen avatar Mar 19 '25 18:03 benkeen

I thought I'd just re-open this. I keep patching this issue here & there, but it's getting more and more hacky and more problems keep sneaking back in.

I created a separate pipeline to do back-to-back builds to examine the problem, and it looks like around 1 in 3 runs generate different shrinkwrap files.

Here's an example:

On one run it generates an shrinkwrap entry for @some/[email protected] like so:

"@some/[email protected](@types/[email protected](@types/[email protected]))(@types/[email protected])([email protected]([email protected]))([email protected])([email protected])": "sha1-P8d1s1CpURzMQorZKPHUS1n4Wmc=",

Then another run generates:

"@some/[email protected](@fluentui/[email protected](@types/[email protected](@types/[email protected]))(@types/[email protected])([email protected]([email protected]))([email protected])([email protected]))(@lexical/[email protected]([email protected]))(@types/[email protected](@types/[email protected]))(@types/[email protected])([email protected])([email protected]([email protected]))([email protected])([email protected])": "sha1-P8d1s1CpURzMQorZKPHUS1n4Wmc=",

Things to note:

  • note the hash is the same in both cases.
  • only the first one actually exists in our pnpm-lock.yaml file.
  • @fluentui/react-combobox is NOT a direct peer dep of @some/package, interestingly enough.

From the above, do you think this is a problem with Rush or pnpm? Does Rush calculate the shrinkwrap file contents or is the work for that offloaded to pnpm?

[[email protected], [email protected]]

benkeen avatar May 05 '25 19:05 benkeen

Hey @dmichon-msft, @octogonz I found the immediate cause of this - just not the root cause.

On our CI we do a rush install to bootstrap. Internally, I see it uses the postInstallAsync hook to generate the project shrinkwrap files. That uses the temporary lockfile generated by the bootstrap phase, not our own checked-in lock file. Those two files can sometimes be different, which is what's causing the differences in the shrinkwrap files.

As a temporary workaround I'm experimenting with running a rush update immediately following the install on our CI. I think that'll regenerate the shrinkwrap files based on our lockfile content and smooth out any differences, but let's see.

I haven't dug in any further than this, but our repo does reliably show this issue (1 in every 3 runs) so if I can provide any more info that could help you pinpoint the problem, do let me know.

benkeen avatar May 06 '25 14:05 benkeen

Running a second rush update following the rush install seems to work great. Haven't had a single inconsistent run yet.

benkeen avatar May 06 '25 20:05 benkeen

@benkeen Does "usePnpmFrozenLockfileForRushInstall": true, in experiments.json help here at all? We're seeing very similar behavior on our end - 1-2 out of every 10 agents goes off on a weird build plan and haven't been able to debug anything effectively.

aramissennyeydd avatar May 08 '25 16:05 aramissennyeydd

Hey @aramissennyeydd - sorry I missed your post. I haven't tried it, but it would require a bit of setup to give it a go and test. Did it help on on your end?

I wasn't clear from the doc exactly what purpose that option serves for Rush - whether it impacts of our committed lockfile, or the temporary one generated by Rush's bootstrap. Also, looks like pnpm doesn't have a no-prefer-frozen-lockfile arg, presumably replaced in favour of --frozen-lockfile in more recent versions?

benkeen avatar May 12 '25 15:05 benkeen

@benkeen I tried it out but didn't see any impact, I'd hope it would impact the temporary lockfile but it seems like it doesn't help.

aramissennyeydd avatar May 22 '25 15:05 aramissennyeydd

After a bit more looking last week, I think the issue is https://github.com/aramissennyeydd/rushstack/blob/dc458cd58dc30b0009b47467b39f0e974a9ef5ae/libraries/rush-lib/src/logic/installManager/WorkspaceInstallManager.ts#L654-L661 which updates the shrinkwrap files after install. It relies on the temp shrinwrap instead of the committed one -- with a comment about filtered installs which makes sense for that case -- but I believe it should generally use the committed shrinkwrap file and fail if they don't match in CI.

We're exploring using phased installs to speed up our CI runs rn which I think has the added benefit of not having a temp shrinkwrap file for installing. I'll report back if that helps.

aramissennyeydd avatar May 27 '25 16:05 aramissennyeydd

@benkeen Quick update, we enabled the frozen lockfile experiment and are actually seeing next to no divergent build graphs / non-deterministic builds. It operates on the temp lockfile which should help when it's written back to the shrinkwrap file. In fact, it's caught a few cases that Rush itself doesn't especially around moving a dependency from a prod -> dev dep.

aramissennyeydd avatar Jun 11 '25 22:06 aramissennyeydd