nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Dirwatch deeper

Open robsyme opened this issue 2 years ago • 8 comments

A PR to address https://github.com/nextflow-io/nextflow/issues/2721.

Three main changes to DirWatcher class:

Allow more complex paths The first (and motivating) change was to increase the flexibility of the Channel.watchPath pattern to allow users to specify known directory names after the glob. That is, paths like projects/*/outputs/*.done

Don't close the channel early The second change is to change the behaviour of the watchPath channel closing. By default if any of the watched directories are deleted, the Channel closes. In the example above, if the folder projects/human is deleted, then the channel closes, and any files in projects/mouse/outputs/*.done will be missed. The changed includes a check to only close the channel if the base folder is removed (projects/).

Don't accidentally miss files The last change is to ensure that if a file is quickly created in a newly created directory, these files are added to the Channel, even if the file is created before Nextflow has had a change to attach a listener object to that directory. In essence, if Nextflow is scanning a newly created directory (looking for child dirs to register watchers) and comes across a file that matches the pattern, it is assumed to be a newly created file and emitted in the channel.

robsyme avatar Mar 15 '22 17:03 robsyme

I see it now, I think it's going be the expected functionality. Thinking also it would not be better to handle multiple pattern at script level and then mixing resulting channel into a single one

pditommaso avatar Mar 15 '22 17:03 pditommaso

I thought about the mixing for my particular use case, but of course not everybody will know ahead of time what those directories might be. The glob should now handle them all neatly.

robsyme avatar Mar 15 '22 18:03 robsyme

I've noticed that the ordering and timing of events on the tests is important and I might need to make some changes. I'd recommend holding off review for the moment while I push some changes.

robsyme avatar Mar 15 '22 21:03 robsyme

This should be ok now.

In fixing the PR, I noticed another small edge case worth considering, and this PR also includes that fix.

The problem:

Registering watchers is not instantaneous, which can lead to the DirWatcher missing newly created files. If a series of directories and a file is created in quick succession:

mkdir -p test/a/b
touch test/a/b/one.txt

... and we have a watcher looking at test/*/*/*.txt, then it is possible that one.txt is created before the DirWatcher has time to register a listener on the b/ directory.

When a new directory is created, DirWatcher traverses down the tree looking for directories to register using a SimpleFileVisitor. Previously, the visitor only considered for new directories (matching the pattern) to register via the preVisitDirectory method. Even though the directories being traversed are brand new, the SimpleFileVisitor did not consider any files there. If the directory is newly created, any files in that directory must also be newly created and should be emitted by the channel.

The fix:

I added a visitFile method to the SimpleFileVisitor that checks for new files that match the pattern and emits them iff watchEvents.contains(ENTRY_CREATE).

The PR also includes a unit test for this behvior.

robsyme avatar Mar 20 '22 14:03 robsyme

Hi Rob, I could not look into this before. However, one important thing is that the DirWatcher logic has been re-implemented using a different strategy to support file system for which notify events were not available.

https://github.com/nextflow-io/nextflow/blob/0af1bcb32d3dc8e01b14b7999b1d9f480a05b62a/modules/nextflow/src/main/groovy/nextflow/file/DirWatcherV2.groovy#L43-L43

Therefore this PR should be updated

pditommaso avatar Jun 19 '22 20:06 pditommaso

No problem. I'm away from my desk this week (in Australia) but will make the updates next week.

On Mon., Jun. 20, 2022, 04:16 Paolo Di Tommaso, @.***> wrote:

Hi Rob, I could not look into this before. However, one important thing is that the DirWatcher logic has been re-implemented using a different strategy to support file system for which notify events were not available.

https://github.com/nextflow-io/nextflow/blob/0af1bcb32d3dc8e01b14b7999b1d9f480a05b62a/modules/nextflow/src/main/groovy/nextflow/file/DirWatcherV2.groovy#L43-L43

Therefore this PR should be updated

— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/pull/2729#issuecomment-1159804193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHWWHTDOYID7663KJCHPLVP553TANCNFSM5QZMLCIA . You are receiving this because you authored the thread.Message ID: @.***>

robsyme avatar Jun 19 '22 23:06 robsyme

The new polling-based system used by DirWatcherV2 address the previous problem without any changes.

This PR includes the changes to the original DirWatcher class, and also extra methods for DirWatcherV2Test that check the behaviour that was missing from DirWatcher, but of course these are now totally optional.

If you'd prefer to avoid changing DirWatcher and/or if you think that the extra tests are unnecessary, that's ok - let me know and I can delete the PR and branch.

robsyme avatar Jul 12 '22 17:07 robsyme

The tradeoff for DirWatcherV2 is that the apache FileAlterationMonitor has to scan the directories before reporting events. In directories with a large number of files, this can mean long delays before events are captured. For those users who know they are operating on a regular filesystem (no need for the polling) with a large number of files (and don't want the slow startup), it might be nice to include the changes to DirWatcher which is available under the NXF_DIRWATCHER_LEGACY env flag.

robsyme avatar Jul 12 '22 19:07 robsyme

Rob, thanks for this, but I believe it's preferable to keep the current basic syntax. It may be simple to support other storage backends in the future.

pditommaso avatar Mar 10 '23 22:03 pditommaso