node_modules cache in pipeline misses frequently
Our pipelines have steps to save/restore zips of the node_modules folders to avoid needing to re-download them every time, but this doesn't seem to be working correctly as it will frequently have a cache miss for PRs that don't have any changes to package.json. We should likely only need to redownload the packages when there are changes to the packages themselves.
e.g. https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=170113&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8
Investigation for this should start at looking into how the cache key is calculated and how the cache job uses that key (is it per branch?)
Per https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security there is some scoping that is per branch, which may be the source of the issues.
But it also mentions and a job building a PR has read access to the caches for the PR's target branch (for the same pipeline) - so as long as the target branch (main) is the same then I would expect it to be able to restore from that. So that's worth investigating - maybe I'm misunderstanding what that means.
@aasimkhan30 was this something you looked into when you made the cache updates a long time back?
Yeah, I had come across this information when I working with pipeline cache and was hitting a cache miss issue. My issue had nothing to do with this information. It just came to my mind when I saw this issue.
However, it feels like we are hitting some kind of scoping issue because there are cache misses for the same key across different PR runs.
Here are 2 different runs with same fingerprint having cache misses:
Fingerprint 1: nodeModules|Linux|zWU6I00RwbHdNUVQywXM0A9t7ARLDaHzFAY3LAQtUyQ=
https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=169966&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8&l=18
Fingerprint2: nodeModules|Linux|zWU6I00RwbHdNUVQywXM0A9t7ARLDaHzFAY3LAQtUyQ=
https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=169966&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8&l=18
As far as I'm seeing, the nodeModules part of the key is treated like a file path when I read the documentation, I wonder if this is intended or not. The cache key is seemingly generated in computeNodeModulesCacheKey, which adds the package.json and yarn.lock files to an SHA hash (via the update function).
Also, my interpretation of the document is that caches are isolated entirely from each other and that the job can only access the cache that is intended for the particular pipeline (they are not supposed to be shared even for the same pipeline). The implementation for this will be tricky to do, but that's for later.
https://github.com/microsoft/azure-pipelines-tasks/issues/12901 This is a report of a similar, if not same issue from another user of Azure Pipelines
This comment is relevant: https://github.com/microsoft/azure-pipelines-tasks/issues/12901#issuecomment-628739689
From my findings of the pipeline, it appears any time you update an extension package or any file listed under build/npm/dirs.js the build cache key is reset as a result, this is by design according to the code. If any changes are made that DO NOT affect the extension package or node modules, the cache is preserved and will be reused.
Make sure you're looking at the correct scripts - we use the sql-product-compile for our pipeline
Which uses sql-computeNodeModulesCacheKey.ts
This does not use npm/dirs.js. The one VS Code uses (and we inherit but don't actually use) does though, but that's not relevant for this discussion.
As the above mentions - the main problem seems to be because we have separate pipeline jobs for Product Build, ad-hoc builds and PR runs. And that's just not something that the cache task supports - you can only share cache artifacts with other runs in the same pipeline (along with the additional scoping information detailed above).
So to "fix" this so they can more consistently reuse the same cache artifacts (and still using this task) we need to consider combining these all into the same pipeline definition on ADO.
What @aasimkhan30 said above is interesting, but those builds expired (next time make sure you retain them) so we can't really investigate. If we can find another example of that happening though that might indicate an actual bug in the cache task though since multiple runs of the same pipeline should be able to re-use the same cache artifacts as long as the scoping rules are followed.
Yeah, just trying to see if nothing is malfunctioning under the current design as it is, so any changes to yarn.lock files plus a couple of other files. If nothing is out of the ordinary for the current design, I'll close this issue as Karl suggested I do (well specifically, for nightly builds having a hit)
We shouldn't close this issue - we haven't fixed anything and we can make improvements as I suggested above if we combine our pipelines all into a single ADO pipeline definition so they can actually reuse the cache artifacts.
Alright then, so that'll take some time to figure out how to merge them all.
Alternatively - a simpler approach might be to just have a scheduled task on the PR pipeline that runs daily (or more often) on the main branch. That way all the PRs against main should be able to use the cache artifacts from that build.
We could even consider merging the canary build into the PR build (and rename it to just be Azure Data Studio - Validation or something more generic) since that runs every hour.
This won't change anything for the product build, but we'll have to pull down the node_modules at some point regardless every time there's a change, and from looking at the history of the product build pipeline it does seem to be re-using the cache artifacts as expected.
(lately we've had a lot of yarn.lock changes so it hasn't been able to restore for a while, but when no changes are made I see it re-using the task such as in https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=172796&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=f5d49255-f229-5ab6-a621-e5b2039a4806)
As a test I ran a PR build against main last night and it looks like that did work as expected
https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=173685&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8
Used scope: 304;ae14e11c-7eb2-46af-b588-471e6116d635;refs/heads/main;Microsoft/azuredatastudio
Missed on the following scopes: 304;ae14e11c-7eb2-46af-b588-471e6116d635;refs/heads/lewissanchez/collapse-expand-all-props;Microsoft/azuredatastudio
Entry found at fingerprint: `nodeModules|Linux|gMeSZHlbjPkplm0G7b4j/jtJ/FJlKO4dYrz4Qj9lzag=`
Updated the PR pipeline to run against main daily which should let PR runs re-use that cache (as long as they aren't changing anything in package.json's)
Further improvements/consolidations can be considered later if this continues to be an issue with failed yarn installs.