jest icon indicating copy to clipboard operation
jest copied to clipboard

cache write race condition across processes

Open bookman25 opened this issue 7 years ago • 122 comments

Do you want to request a feature or report a bug? bug

What is the current behavior? When running tests in parallel with the new atomic cache writing, we're getting rename errors as multiple processes try to write to the same files. Even with --no-cache option set it's still hitting rename errors because it's still trying to write to the files.

What is the expected behavior?

  1. I think that --no-cache should not write cache files
  2. Caching across multiple processes should not collide, or should be able to restart the test.

Please provide your exact Jest configuration and mention your Jest, node, yarn/npm version and operating system.

{
	"clearMocks": true,
	"collectCoverageFrom": [
		"packages/**/src/**/*.{ts,tsx}",
		"!packages/sf-lint/**",
		"!**/*.d.ts"
	],
	"coverageReporters": [
		"text-summary"
	],
	"moduleFileExtensions": [
		"ts",
		"tsx",
		"js",
		"json"
	],
	"setupTestFrameworkScriptFile": "<rootDir>/jestEnvironment.js",
	"transform": {
		"\\.(ts|tsx)$": "<rootDir>/scripts/preprocessTypescript.js",
		"\\.(less|css|svg|gif|png|jpg|jpeg)$": "<rootDir>/scripts/preprocessText.js"
	},
	"testRegex": "(Spec|.spec).tsx?$"
}

jest 21.0.1 node 6.9.2 yarn 0.27.x/1.0.0 OS Windows

bookman25 avatar Sep 07 '17 14:09 bookman25

Isn't this related? https://github.com/facebook/jest/pull/4432

thymikee avatar Sep 11 '17 21:09 thymikee

I don't believe so. I believe the case we see in our repo is the exact same file getting mocked for 2 different processes (while running in parallel) which causes the cache write operation to fail because one process has the file locked. That ticket looks like is more about different files with the same contents. We don't have any of those cases in the repositories we host that we ran into this issue.

bookman25 avatar Sep 11 '17 23:09 bookman25

We basically run into the same issue with our tests. One easy way to reproduce was to remove jest cacheDirectory to force cache generation on the next run.


    jest: failed to cache transform results in: 
C:/myniceproject/src/jest-cache/jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b/3f/settingsProvider_3f1439e55275a95ecfdb7dcb432f7958
   Failure message: EPERM: operation not permitted, rename 
'C:\myniceproject\src\jest-cache\jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958.1630729137' 
-> 
'C:\myniceproject\src\jest-cache\jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958'`

cristiango avatar Sep 21 '17 15:09 cristiango

Having the same issue and can't find a way around it. Jest is basically unusable for us like this.

DavidKuennen avatar Sep 25 '17 07:09 DavidKuennen

We are trying to update to 21.2.0 from 20.0.4 and now we have the following error on our build servers:

Test suite failed to run
[13:46:50]
[13:46:50]    jest: failed to cache transform results in: C:/TeamCity/buildAgent/temp/buildTmp/jest/jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745/3b/fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770
[13:46:50]    Failure message: EPERM: operation not permitted, rename '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770.1701848979' -> '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770'
[13:46:50]      
[13:46:50]      at Object.fs.renameSync (fs.js:772:18)
[13:46:50]      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

vkrol avatar Sep 27 '17 10:09 vkrol

I'm now having the same issue tests are broken randomly.

apphelpyulongs avatar Sep 27 '17 18:09 apphelpyulongs

If I run the tests with the --runInBand flag then as expected everything is OK.

vkrol avatar Sep 29 '17 16:09 vkrol

I can see the same issue fairly consistently:

  ● Test suite failed to run

    jest: failed to cache transform results in: .../jest-transform-cache-...
    Failure message: EPERM: operation not permitted, rename '...' -> '...'
        at Error (native)

      at Object.fs.renameSync (fs.js:810:18)
      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

jest 21.2.1 node 6.11.1 OS Windows

--no-cache does not help and jest-transform-cache is still being written. The only thing that helps is --runInBand, which is hardly acceptable for large projects.

Anything we can do to help diagnose the issue? Should I create a repro case?

Is this error critical? Can it be treated as a warning rather than taking down the whole test suite? Is there a way to back off and retry?

asapach avatar Oct 07 '17 19:10 asapach

Having a small repro would be great

SimenB avatar Oct 07 '17 19:10 SimenB

Here's the repro: https://github.com/asapach/jest-cache-issue/ It effectively runs lodash-es through babel-jest to populate the transform cache. This fails for me 80% of the time on two different machines (Win8.1 and Win10). If you remove --no-cache it fails 100% of the time. Adding --runInBand brings it down to 0%.

(Out of curiosity tried running it in WSL on Win10 and the issue is not reproducible using Posix API)

asapach avatar Oct 07 '17 22:10 asapach

Is this just happening on Windows? I don't have access to windows machines beyond virtual machines, so not the easiest to debug for me...

@jeanlauliac you added write-file-atomic in #4088, would you be able to help out?

SimenB avatar Oct 08 '17 10:10 SimenB

This problem is very similar to https://github.com/npm/write-file-atomic/issues/10 and https://github.com/npm/write-file-atomic/pull/22.

vkrol avatar Oct 08 '17 10:10 vkrol

Just ran a procmon trace, here's an example of the issue:

Time of Day Process Name PID Operation Path Result Detail
16:54:43.2304011 node.exe 7624 SetRenameInformationFile ...\constant_ee286bbcf367680ce61a90e506522f92.82986661 SUCCESS ReplaceIfExists: True, FileName: ...\constant_ee286bbcf367680ce61a90e506522f92
16:54:43.2305499 node.exe 8208 SetRenameInformationFile ...\constant_ee286bbcf367680ce61a90e506522f92.103872574 ACCESS DENIED ReplaceIfExists: True, FileName: ...\constant_ee286bbcf367680ce61a90e506522f92

As you can see two processes are trying rename the same file within 1ms of each other and the second one fails.

I think npm/write-file-atomic#22 addresses the async version of writeFile(), but writeFileSync() is still affected.

asapach avatar Oct 08 '17 14:10 asapach

Would it be possible to create a repro showing that just using write-file-atomic in worker-farm against the same file fails somehow? Would be great to open an issue against that project, as I think that's where the fix should be.

Or if you could write a test within jest that shows the same error (we have appveyor CI) that could be a start as well?

I'm not even sure what behavior we want in case of this error. Retry the write? Rerun the test? The whole test file?

SimenB avatar Oct 08 '17 14:10 SimenB

OK, I'll try to create another repro. Not sure it's possible to create a jest test, because it would require spawning multiple processes, disabling/cleaning the cache and keep running until it fails.

I'm not even sure what behavior we want in case of this error.

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated. Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()? That way write-file-atomic should retry automatically (I'll create a test to confirm).

asapach avatar Oct 08 '17 14:10 asapach

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated.

That's a good point, and should be fixed separately. That way --no-cache can at least be a workaround.

Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()?

@cpojer thoughts on making it not be sync? Not sure how that scales. Or if you have another idea on how to fix this

SimenB avatar Oct 08 '17 15:10 SimenB

  • --no-cache is more like --reset-cache actually. It means it won't use the existing cache, but it will still write cache. I'd like to retain that.
  • These operations have to be sync, because they happen during require calls in user code, so we can't change that.

cpojer avatar Oct 08 '17 17:10 cpojer

Here's the other repro with worker-farm and write-file-atomic: https://github.com/asapach/write-atomic-issue

Findings so far: the sync version fails as expected, but surprisingly the async version fails as well. This means that they probably implement a retry queue only when it runs in the same process, which doesn't help in our case.

asapach avatar Oct 08 '17 17:10 asapach

I'd like to retain that.

New flag? It's a highly misleading name. And on e.g. CI you rarely want the cache anyways, so it's just wasted resources. Or is a cache generated within a single test run used during --no-cache, and it only ignores existent caches?

Here's the other repro with worker-farm and write-file-atomic

Awesome! Could you open up an issue against write-file-atomic? It feels like a fix should go there, and if not (they don't want to support multiple processes writing at once) we can revisit on our side. WDYT?

SimenB avatar Oct 08 '17 18:10 SimenB

A patch I tried locally that seemed to work is ignoring the error if it comes from trying to rename to a file with the same content. Since it just means another process 'won' the race, we can ignore it and move on.

const cacheWriteErrorSafeToIgnore = (
  e: Error,
  cachePath: Path,
  fileData: string,
) => {
  if (
    e.message &&
    e.message.indexOf('EPERM: operation not permitted, rename') > -1
  ) {
    try {
      const currentContent = fs.readFileSync(cachePath, 'utf8');
      return fileData === currentContent;
    } catch (e) {
    }
  }
  return false;
};

jwbay avatar Oct 08 '17 18:10 jwbay

@SimenB, sure, I'll file an issue.

@cpojer, can this error be swallowed/ignored and treated as a warning? It implies that the file has already been written and no data should be lost.

asapach avatar Oct 08 '17 18:10 asapach

Upstream issue: npm/write-file-atomic#28

asapach avatar Oct 08 '17 18:10 asapach

I think this means "rename" is not an atomic operation on Windows, so it breaks the assumption made by write-file-atomic. Unless there's a flag that could be enabled at the Windows API level, this could mean it's impossible to have atomic writes/renames on Windows altogether.

@jwbay your solution looks reasonable to me! 👍 Instead of using indexOf however, I'd use e.code === 'EPERM' (more robust, doesn't depend on specific message). I don't think we should read the file again to check the value, because this could introduce additional concurrency issues (ex. if the file is being written by yet another process at the same time). Would you mind sending a PR, please?

jeanlauliac avatar Oct 09 '17 09:10 jeanlauliac

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on). But if we're happy to handle this at the Jest level, I won't hurry. cc @jeanlauliac

davidjgoss avatar Oct 09 '17 10:10 davidjgoss

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on).

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

To fix concurrency issues once and for all, we may have to reconsider how we do caching, for example have a single process that access the cache, with which we communicate over some kind of IPC. Existing key/value store systems may be handy, such as memcached.

jeanlauliac avatar Oct 09 '17 10:10 jeanlauliac

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

Ah, I perhaps misunderstood the issue then. The way I read it, the library already has a queuing mechanism that works nicely for the async requests, but if you mix in sync requests as well you can get collisions.

davidjgoss avatar Oct 09 '17 19:10 davidjgoss

My above referenced pull request should solve this issue. At least it did it for me!

mekwall avatar Oct 10 '17 14:10 mekwall

@mekwall, I think they are using rename() in the async version of writeFile(), and it still fails in my test: https://github.com/asapach/write-atomic-issue. Could you please try running my repro? I think your change might minimize likelihood of this problem happening, but does not eliminate it completely.

asapach avatar Oct 10 '17 14:10 asapach

@asapach Did you try with my changes? Because I tried several times, and I never got EPERM: operation not permitted, rename with my changes while getting it every time without.

mekwall avatar Oct 10 '17 15:10 mekwall

@mekwall, yep, still failing with your changes (although async-ly). (Corrected below.)

asapach avatar Oct 10 '17 15:10 asapach