returnn `FileCache` crashes often when files to be cached are small

I have a scenario where the files to be cached are quite small because they only contain the transcription text. The FileCache often crashes here during startup. I want to mainly report this error for now.

I think this is because we asynchronously update the lock files mtimes in a separate thread. There is a race condition when the file cache removes the lock file during a run of the mtime update thread. The update thread then tries to do its work but fails because the lock file has already been removed in the meantime (because the copy operation has finished).

This copy for the GIL is a part of the error, this is where the thread gets its local copy of the list of files to work on: https://github.com/rwth-i6/returnn/blob/6a2bf2b13778ac9d4755da34fc8437e25d97e380/returnn/util/file_cache.py#L534

When the list of files is then modified while the thread is updating the mtimes, the update is not reflected in the sub thread (as originally was intended).

Maybe we should count the number of failures in the mtime update thread and only raise an exception when an update fails for a file multiple times. In this case the file would be removed from the update thread's list before the next update, and so the thread would not retry (and swallow the first error).

Log:

FileCache: Copy file /home/mgunz/setups/2025-03-07--ws-asr/work/apptek_asr/hdf/create/BlissToTextHdfJob.DkiTGhrM4EpU/output/corpus.hdf.030 to cache
FileCache: Copy file /home/mgunz/setups/2025-03-07--ws-asr/work/apptek_asr/hdf/create/BlissToTextHdfJob.DkiTGhrM4EpU/output/corpus.hdf.031 to cache
FileCache: failed updating mtime of /ssd/mgunz/returnn/file_cache/home/mgunz/setups/2025-03-07--ws-asr/work/apptek_asr/hdf/create/BlissToTextHdfJob.DkiTGhrM4EpU/output/corpus.hdf.030.returnn-lock: [Errno 2] No such file or directory
Unhandled exception in thread <_TouchFilesThread(Thread-1, started daemon 140181047551552)>, going to interrupt main thread:
EXCEPTION
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner
    line: self.run()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140181047551552)>
      self.run = <local> <bound method _TouchFilesThread.run of <_TouchFilesThread(Thread-1, started daemon 140181047551552)>>
  File "/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/util/file_cache.py", line 539, in _TouchFilesThread.run
    line: os.utime(filename, None)
    locals:
      os = <global> <module 'os' from '/usr/lib/python3.10/os.py'>
      os.utime = <global> <built-in function utime>
      filename = <local> '/ssd/mgunz/returnn/file_cache/home/mgunz/setups/2025-03-07--ws-asr/work/apptek_asr/hdf/create/BlissToTextHdfJob.DkiTGhrM4EpU/output/corpus.hdf.030.returnn-lock', len = 159
FileNotFoundError: [Errno 2] No such file or directory

KeyboardInterrupt
Unhandled exception <class 'KeyboardInterrupt'> in thread <_MainThread(MainThread, started 140190866928768)>, proc 4165609.

Jun 03 '25 14:06 NeoLegends

There is a race condition when the file cache removes the lock file during a run of the mtime update thread.

I'm not sure I understand. So you mean:

Main thread, in FileCopy._copy_file_if_needed: Within:
```
with LockFile(
    directory=lock_dir, name=lock_file, lock_timeout=self._lock_timeout
) as lock, self._touch_files_thread.files_added_context(lock.lockfile):
```
So this is about this lock.lockfile here, right? And then within FileCopy.cleanup? But within that function, it would not be possible to delete that lock.lockfile:
- It does not delete any lockfiles while iterating through all files.
- When deleting other files, it tries to create a new lockfile. If that is the same as lock.lockfile from the outer scope: That should fail because this file already exist. Or do you say that it removes it via LockFile.maybe_remove_old_lockfile? But if so, you should see it in the log ("Removing old lockfile"). If you don't see this, you should see "FileCache: lock for {fn} is currently held, skipping.". If you also don't see this, it means FileCopy.cleanup is not deleting any related files (as we actually also would expect).
Or you mean again when exiting this with scope, i.e. it calls:
- _TouchFilesThread.files_remove on the lock.lockfile.
- Then LockFile.__exit__ calls LockFile.unlock and that will delete the lock file.
_TouchFilesThread: Tries to update the mtime (it did the self.files.copy() before the main thread called _TouchFilesThread.files_remove, but tries to update the mtime after the main thread deleted the file).

If that is the case, the problem is very simple to fix, or not? There are multiple ways to fix this. We could add some Lock to _TouchFilesThread. Some details would need to be sorted out, e.g. should we use a lock per file, or a global lock for the whole _TouchFilesThread? Per file is maybe more efficient. (Note, I think a threading.Lock in Python is very cheap: I think this is a spinlock, just using the underlying PyMutex here.) In _TouchFilesThread.run, it should be held while handling one of the files (and also all parent dirs), and at the same time, in _TouchFilesThread.files_remove.

Jun 03 '25 20:06 albertz

If that is the case

Yes, this is the case I think about. I think for simplicity I'll use a single, global lock first (otherwise we have a list of locks, which again causes the same locking issues one level up -> would need a lock to protect the list of locks?).

Maybe, instead of using a lock it is also possible to simply re-check the list of files for whether the file is still supposed to be updated when encountering a FileNotFoundError, but I'm doubt this is really correct in all cases. I think there can still be race conditions if e.g. a file is added to the touch files thread and removed from it multiple times in quick succession.

If we notice the single, global lock is too slow I am going to follow-up with another PR/commit.

Jun 03 '25 21:06 NeoLegends

I think for simplicity I'll use a single, global lock first

But this could cause some hangs. The list of files could be large, and the main thread should not hang to add sth there. I would avoid that.

(otherwise we have a list of locks, which again causes the same locking issues one level up -> would need a lock to protect the list of locks?).

No, I don't think this is needed.

If we notice the single, global lock is too slow I am going to follow-up with another PR/commit.

I think profiling/debugging this is way more complicated than just directly doing it the right way. It's not really so complicated.

Jun 03 '25 21:06 albertz

Btw, how often does this problem occur? How many people are affected by this? I thought many people were already using DistributeFilesDataset/FileCache since a while, and since the last fixes on this, everything was working well?

I also don't exactly understand the conditions which trigger this bug here. You said it's about having quite small files. But why would this trigger it? Maybe the actually trigger condition is that you have a lot of files?

Jun 05 '25 09:06 albertz

Btw, how often does this problem occur? How many people are affected by this? I thought many people were already using DistributeFilesDataset/FileCache since a while, and since the last fixes on this, everything was working well?

I don't think anybody used DFD/FileCache with transcription-only files that are very small yet. AppTek uses DFD on audio feature HDFS (large) and alignment HDFs (not large, but larger than transcription-only HDFs).

I also don't exactly understand the conditions which trigger this bug here. You said it's about having quite small files. But why would this trigger it?

I don't understand why you don't understand it. I think your comment/question

TouchFilesThread: Tries to update the mtime (it did the self.files.copy() before the main thread called _TouchFilesThread.files_remove, but tries to update the mtime after the main thread deleted the file).

explains it quite well. Of course the number of files also plays a role here, as the duration the loop takes is influenced by the number of files and the time it takes to update each file. And the longer the loop takes the more likely the error. The files, however, also need to be small-ish, I think, because this causes the lockfiles to be created and deleted frequently, as they are only alive during the copy operation, which does not take long for small files.

Jun 23 '25 10:06 NeoLegends

returnn returnn copied to clipboard

`FileCache` crashes often when files to be cached are small

returnn
returnn copied to clipboard