node icon indicating copy to clipboard operation
node copied to clipboard

Node 20 recursive file watching is misbehaving on Linux

Open cjihrig opened this issue 2 years ago • 8 comments

Version

20.3.0

Platform

linux

Subsystem

fs

What steps will reproduce the bug?

I put together a repro at https://github.com/cjihrig/recursive-watcher-bug that shows the following passing on Windows and macOS, but failing on Ubuntu. I'm not sure if this is specific to GitHub Actions.

'use strict';
const { mkdtempSync, watch, writeFileSync } = require('node:fs');
const { tmpdir } = require('node:os');
const { join } = require('node:path');
const tmpDir = mkdtempSync(join(tmpdir(), 'repro-test-'));
const filename = join(tmpDir, 'test.file');
const keepalive = setTimeout(() => {
  throw new Error('timed out');
}, 60_000);

const watcher = watch(tmpDir, { recursive: true }, (eventType, filename) => {
  clearTimeout(keepalive);
  watcher.close();
  console.log(eventType, filename);
});

writeFileSync(filename, 'foobar');

How often does it reproduce? Is there a required condition?

Always reproduces for me.

What is the expected behavior? Why is that the expected behavior?

I expect the test to pass.

What do you see instead?

The test times out.

Additional information

I noticed this while trying to update Platformatic to support Node 20 and created the minimal reproduction linked above.

I also noticed that Platformatic was passing recursive: true to the promisified version of watch() on Ubuntu on earlier versions of Node. It should have thrown ERR_FEATURE_UNAVAILABLE_ON_PLATFORM, but did not. I did see that error with the callback based watch() though, which makes me think there is some missing validation on older versions of Node in addition to this bug.

cc: @anonrig who implemented recursive file watching on Linux.

cjihrig avatar Jun 12 '23 17:06 cjihrig

Does the issue resolve if you add a small timeout before writing file and after creating the watcher? I suspect the async nature of fs which does polling causes this issue.

anonrig avatar Jun 12 '23 18:06 anonrig

Looks like that did help - https://github.com/cjihrig/recursive-watcher-bug/actions/runs/5247088842/jobs/9476792611.

I don't think we can realistically ask every user of recursive file watching to add that to their code though.

cjihrig avatar Jun 12 '23 18:06 cjihrig

Looks like that did help - https://github.com/cjihrig/recursive-watcher-bug/actions/runs/5247088842/jobs/9476792611.

I don't think we can realistically ask every user of recursive file watching to add that to their code though.

@cjihrig I agree. It seems the issue is within this lines: https://github.com/nodejs/node/blob/main/lib/internal/fs/recursive_watch.js#L222

We can avoid using promisify, but I'm suspicious towards blocking the main thread while this is resolving. Unfortunately, I'm on parental leave and have limited access. If this task is not urgent, I can take a look at it in a week and a half.

anonrig avatar Jun 12 '23 18:06 anonrig

Can I take on this issue?

winedarkmoon avatar Jun 17 '23 17:06 winedarkmoon

Can I take on this issue?

Please do. Happy to help if you get stuck on something.

anonrig avatar Jun 17 '23 19:06 anonrig

Can I take on this issue?

Please do. Happy to help if you get stuck on something.

So the traverse function doesn't complete its execution before the files start to be watched, thus causing a race condition? Would introducing a mechanism to 'await' the initialization of the watcher could resolve this issue?

winedarkmoon avatar Jun 19 '23 18:06 winedarkmoon

So the traverse function doesn't complete its execution before the files start to be watched, thus causing a race condition? Would introducing a mechanism to 'await' the initialization of the watcher could resolve this issue?

That would be a breaking change since it will cause the function to behave differently in different operating systems.

anonrig avatar Jun 19 '23 19:06 anonrig

So the traverse function doesn't complete its execution before the files start to be watched, thus causing a race condition? Would introducing a mechanism to 'await' the initialization of the watcher could resolve this issue?

That would be a breaking change since it will cause the function to behave differently in different operating systems.

Understood. Would it be feasible to implement a flag/event that is set once the traverse function has completed? This way, we could delay file watching until after the initial traversal?

winedarkmoon avatar Jun 19 '23 20:06 winedarkmoon

Looks like that did help - https://github.com/cjihrig/recursive-watcher-bug/actions/runs/5247088842/jobs/9476792611. I don't think we can realistically ask every user of recursive file watching to add that to their code though.

@cjihrig I agree. It seems the issue is within this lines: https://github.com/nodejs/node/blob/main/lib/internal/fs/recursive_watch.js#L222

We can avoid using promisify, but I'm suspicious towards blocking the main thread while this is resolving. Unfortunately, I'm on parental leave and have limited access. If this task is not urgent, I can take a look at it in a week and a half.

Looks like that did help - https://github.com/cjihrig/recursive-watcher-bug/actions/runs/5247088842/jobs/9476792611. I don't think we can realistically ask every user of recursive file watching to add that to their code though.

@cjihrig I agree. It seems the issue is within this lines: https://github.com/nodejs/node/blob/main/lib/internal/fs/recursive_watch.js#L222

We can avoid using promisify, but I'm suspicious towards blocking the main thread while this is resolving. Unfortunately, I'm on parental leave and have limited access. If this task is not urgent, I can take a look at it in a week and a half.

So the traverse function doesn't complete its execution before the files start to be watched, thus causing a race condition? Would introducing a mechanism to 'await' the initialization of the watcher could resolve this issue?

That would be a breaking change since it will cause the function to behave differently in different operating systems.

Regarding the race condition issue in the code you shared, I've analyzed it and come up with a solution. The problem arises from the traverse function not completing its execution before the files start being watched, causing a race condition.

I propose the following solution to address this issue:

`async function traverse(dir, files = new SafeMap(), symbolicLinks = new SafeSet()) { const { opendir } = lazyLoadFsPromises();

const filenames = await opendir(dir); const subdirectories = [];

for await (const file of filenames) { const f = pathJoin(dir, file.name);

files.set(f, file);

// Do not follow symbolic links
if (file.isSymbolicLink()) {
  symbolicLinks.add(f);
} else if (file.isDirectory()) {
  subdirectories.push(traverse(f, files)); // Store the promise for each subdirectory
}

}

await Promise.all(subdirectories); // Wait for all subdirectory promises to resolve

return files; }

// ...

async kFSWatchStart { filename = pathResolve(getValidatedPath(filename));

try { const file = lazyLoadFsSync().statSync(filename);

this.#rootPath = filename;
this.#closed = false;
this.#watchingFile = file.isFile();

if (file.isDirectory()) {
  this.#files.set(filename, file);

  await traverse(filename, this.#files, this.#symbolicFiles); // Await the completion of the traverse function

  for (const f of this.#files.keys()) {
    this.#watchFile(f);
  }
} else {
  this.#watchFile(filename);
}

} catch (error) { if (error.code === 'ENOENT') { error.filename = filename; throw error; } } }`

Here are the changes made and their explanations:

1.Modified the traverse function to be an async function and added the await keyword before the recursive call to traverse for subdirectories. This ensures that the function waits for the subdirectory traversal to complete before moving on, resolving the race condition.

2.In the [kFSWatchStart] method, added the await keyword before calling traverse for the initial directory. This ensures that the function waits for the traversal to complete before proceeding with watching the files.

spacesugam avatar Jun 23 '23 19:06 spacesugam

@spacesugam are you still interested in sending a PR?

I would go for the first option.

mcollina avatar Jul 26 '23 16:07 mcollina

@anonrig @cjihrig I took another look at this issue, and there is a race condition between traversing the file system asynchronously and watch. Ultimately we are missing an event, i.e. "the watch has been successfully set up". The alternative is to make traversing the tree synchronous on Linux, so that it can all be set up correctly.

This is not a problem on Windows or Mac because they have different low-level primitives for watching

mcollina avatar Oct 18 '23 12:10 mcollina

The only solution to this bug is to move the exploration of the directory tree to synchronous on Linux. @anonrig do you see any specific issue with that?

mcollina avatar Dec 26 '23 09:12 mcollina

Quite possibly related, calling fs.watch() in recursive mode with an already aborted signal does not prevent StatWatchers from being created and referenced. Same with synchronously closing the returned watcher.

novemberborn avatar Jan 02 '24 21:01 novemberborn

The only solution to this bug is to move the exploration of the directory tree to synchronous on Linux. @anonrig do you see any specific issue with that?

Looks good to me!

Quite possibly related, calling fs.watch() in recursive mode with an already aborted signal does not prevent StatWatchers from being created and referenced. Same with synchronously closing the returned watcher.

@novemberborn Can you create a different issue on this, with a possible reproduction please?

anonrig avatar Jan 02 '24 21:01 anonrig