nextflow
nextflow copied to clipboard
Dirwatch deeper
A PR to address https://github.com/nextflow-io/nextflow/issues/2721.
Three main changes to DirWatcher
class:
Allow more complex paths
The first (and motivating) change was to increase the flexibility of the Channel.watchPath
pattern to allow users to specify known directory names after the glob. That is, paths like projects/*/outputs/*.done
Don't close the channel early
The second change is to change the behaviour of the watchPath channel closing. By default if any of the watched directories are deleted, the Channel closes. In the example above, if the folder projects/human
is deleted, then the channel closes, and any files in projects/mouse/outputs/*.done
will be missed. The changed includes a check to only close the channel if the base folder is removed (projects/
).
Don't accidentally miss files The last change is to ensure that if a file is quickly created in a newly created directory, these files are added to the Channel, even if the file is created before Nextflow has had a change to attach a listener object to that directory. In essence, if Nextflow is scanning a newly created directory (looking for child dirs to register watchers) and comes across a file that matches the pattern, it is assumed to be a newly created file and emitted in the channel.
I see it now, I think it's going be the expected functionality. Thinking also it would not be better to handle multiple pattern at script level and then mix
ing resulting channel into a single one
I thought about the mixing for my particular use case, but of course not everybody will know ahead of time what those directories might be. The glob should now handle them all neatly.
I've noticed that the ordering and timing of events on the tests is important and I might need to make some changes. I'd recommend holding off review for the moment while I push some changes.
This should be ok now.
In fixing the PR, I noticed another small edge case worth considering, and this PR also includes that fix.
The problem:
Registering watchers is not instantaneous, which can lead to the DirWatcher
missing newly created files. If a series of directories and a file is created in quick succession:
mkdir -p test/a/b
touch test/a/b/one.txt
... and we have a watcher looking at test/*/*/*.txt
, then it is possible that one.txt
is created before the DirWatcher has time to register a listener on the b/
directory.
When a new directory is created, DirWatcher traverses down the tree looking for directories to register using a SimpleFileVisitor
. Previously, the visitor only considered for new directories (matching the pattern) to register via the preVisitDirectory
method. Even though the directories being traversed are brand new, the SimpleFileVisitor
did not consider any files there. If the directory is newly created, any files in that directory must also be newly created and should be emitted by the channel.
The fix:
I added a visitFile
method to the SimpleFileVisitor
that checks for new files that match the pattern and emits them iff watchEvents.contains(ENTRY_CREATE)
.
The PR also includes a unit test for this behvior.
Hi Rob, I could not look into this before. However, one important thing is that the DirWatcher logic has been re-implemented using a different strategy to support file system for which notify events were not available.
https://github.com/nextflow-io/nextflow/blob/0af1bcb32d3dc8e01b14b7999b1d9f480a05b62a/modules/nextflow/src/main/groovy/nextflow/file/DirWatcherV2.groovy#L43-L43
Therefore this PR should be updated
No problem. I'm away from my desk this week (in Australia) but will make the updates next week.
On Mon., Jun. 20, 2022, 04:16 Paolo Di Tommaso, @.***> wrote:
Hi Rob, I could not look into this before. However, one important thing is that the DirWatcher logic has been re-implemented using a different strategy to support file system for which notify events were not available.
https://github.com/nextflow-io/nextflow/blob/0af1bcb32d3dc8e01b14b7999b1d9f480a05b62a/modules/nextflow/src/main/groovy/nextflow/file/DirWatcherV2.groovy#L43-L43
Therefore this PR should be updated
— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/pull/2729#issuecomment-1159804193, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHWWHTDOYID7663KJCHPLVP553TANCNFSM5QZMLCIA . You are receiving this because you authored the thread.Message ID: @.***>
The new polling-based system used by DirWatcherV2 address the previous problem without any changes.
This PR includes the changes to the original DirWatcher class, and also extra methods for DirWatcherV2Test that check the behaviour that was missing from DirWatcher, but of course these are now totally optional.
If you'd prefer to avoid changing DirWatcher and/or if you think that the extra tests are unnecessary, that's ok - let me know and I can delete the PR and branch.
The tradeoff for DirWatcherV2
is that the apache FileAlterationMonitor
has to scan the directories before reporting events. In directories with a large number of files, this can mean long delays before events are captured. For those users who know they are operating on a regular filesystem (no need for the polling) with a large number of files (and don't want the slow startup), it might be nice to include the changes to DirWatcher
which is available under the NXF_DIRWATCHER_LEGACY
env flag.
Rob, thanks for this, but I believe it's preferable to keep the current basic syntax. It may be simple to support other storage backends in the future.