azuredatastudio node_modules cache in pipeline misses frequently

Our pipelines have steps to save/restore zips of the node_modules folders to avoid needing to re-download them every time, but this doesn't seem to be working correctly as it will frequently have a cache miss for PRs that don't have any changes to package.json. We should likely only need to redownload the packages when there are changes to the packages themselves.

e.g. https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=170113&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8

Investigation for this should start at looking into how the cache key is calculated and how the cache job uses that key (is it per branch?)

Sep 07 '22 20:09 Charles-Gagnon

Per https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security there is some scoping that is per branch, which may be the source of the issues.

But it also mentions and a job building a PR has read access to the caches for the PR's target branch (for the same pipeline) - so as long as the target branch (main) is the same then I would expect it to be able to restore from that. So that's worth investigating - maybe I'm misunderstanding what that means.

@aasimkhan30 was this something you looked into when you made the cache updates a long time back?

Sep 07 '22 21:09 Charles-Gagnon

Yeah, I had come across this information when I working with pipeline cache and was hitting a cache miss issue. My issue had nothing to do with this information. It just came to my mind when I saw this issue.

Sep 07 '22 21:09 aasimkhan30

However, it feels like we are hitting some kind of scoping issue because there are cache misses for the same key across different PR runs.

Here are 2 different runs with same fingerprint having cache misses:

Fingerprint 1: nodeModules|Linux|zWU6I00RwbHdNUVQywXM0A9t7ARLDaHzFAY3LAQtUyQ=

https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=169966&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8&l=18

Fingerprint2: nodeModules|Linux|zWU6I00RwbHdNUVQywXM0A9t7ARLDaHzFAY3LAQtUyQ=

https://mssqltools.visualstudio.com/CrossPlatBuildScripts/_build/results?buildId=169966&view=logs&j=c7493abb-a1f4-533f-2d24-71780a69f247&t=5fc456a7-9c5f-500f-2da3-8aee44c87ba8&l=18

Sep 07 '22 21:09 aasimkhan30

As far as I'm seeing, the nodeModules part of the key is treated like a file path when I read the documentation, I wonder if this is intended or not. The cache key is seemingly generated in computeNodeModulesCacheKey, which adds the package.json and yarn.lock files to an SHA hash (via the update function).

Also, my interpretation of the document is that caches are isolated entirely from each other and that the job can only access the cache that is intended for the particular pipeline (they are not supposed to be shared even for the same pipeline). The implementation for this will be tricky to do, but that's for later.

Oct 05 '22 20:10 smartguest

https://github.com/microsoft/azure-pipelines-tasks/issues/12901 This is a report of a similar, if not same issue from another user of Azure Pipelines

This comment is relevant: https://github.com/microsoft/azure-pipelines-tasks/issues/12901#issuecomment-628739689

Oct 05 '22 20:10 smartguest

From my findings of the pipeline, it appears any time you update an extension package or any file listed under build/npm/dirs.js the build cache key is reset as a result, this is by design according to the code. If any changes are made that DO NOT affect the extension package or node modules, the cache is preserved and will be reused.

Oct 05 '22 22:10 smartguest

Make sure you're looking at the correct scripts - we use the sql-product-compile for our pipeline

Which uses sql-computeNodeModulesCacheKey.ts

This does not use npm/dirs.js. The one VS Code uses (and we inherit but don't actually use) does though, but that's not relevant for this discussion.