ctcache
ctcache copied to clipboard
Frequent stats.lock contention
Hi,
Since trying to move to the latest version of ctcache (with the stats added to local-mode) we're getting frequent errors below. Looks to be heavy contention on the stats file when running parallel builds. Interestingly these errors don't cause any builds failures... are the exceptions being ignored and the build has actually not completed or are these errors benign and should therefore be turned into warnings instead of errors?
Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
File "/ctcache/dac994e/clang-tidy-cache", line 436, in acquire
self.lock_handle = os.open(self.lockfile, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/ctcache/stats.lock'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ctcache/dac994e/clang-tidy-cache", line 486, in update_stats
hits, misses = self.read_stats()
File "/ctcache/dac994e/clang-tidy-cache", line 473, in read_stats
with FileLock(self.stats_file() + ".lock") as _:
File "/ctcache/dac994e/clang-tidy-cache", line 460, in __enter__
return self.acquire()
File "/ctcache/dac994e/clang-tidy-cache", line 443, in acquire
raise RuntimeError(msg)
I also found this issue in #35 where lock file is not closed before unlink, so the file still exists and is locked.
I've added code that closes the file lock handle before unlinking it from the filesystem, can you please test if that solves your issue?
Issue still persists.
Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
File "/ctcache/ef992ef/clang-tidy-cache", line 436, in acquire
self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/ctcache/stats.lock'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ctcache/ef992ef/clang-tidy-cache", line 487, in update_stats
hits, misses = self.read_stats()
File "/ctcache/ef992ef/clang-tidy-cache", line 474, in read_stats
with FileLock(self.stats_file() + ".lock") as _:
File "/ctcache/ef992ef/clang-tidy-cache", line 461, in __enter__
return self.acquire()
File "/ctcache/ef992ef/clang-tidy-cache", line 443, in acquire
raise RuntimeError(msg)
Root cause should be this:
Traceback (most recent call last):
File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 490, in update_stats
hits, misses = self.read_stats()
^^^^^^^^^^^^^^^^^
File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 477, in read_stats
with FileLock(self.stats_file() + ".lock") as _:
File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 467, in __exit__
self.release()
File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 456, in release
self._lock_handle.close()
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'close'
https://github.com/matus-chochlik/ctcache/blob/ef992efb098292053683ac0a3e678d06f05111b8/clang-tidy-cache#L453
I think you should close the file by os.close(self._lock_handle)
just pushed a new commit that uses os.close
did the new changes help?
No, still getting errors. The other thing that is very worrying is that I'm seeing all these errors in the log but the compiler doesn't exit with an error. Is this intended? Are we 100% certain the compilation actually continued after this exception, because if not we should fail the compilation?
Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
File "/ctcache/b2aa2b3/clang-tidy-cache", line 436, in acquire
self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/5ll/ctcache/stats.lock'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ctcache/b2aa2b3/clang-tidy-cache", line 487, in update_stats
hits, misses = self.read_stats()
File "/ctcache/b2aa2b3/clang-tidy-cache", line 474, in read_stats
with FileLock(self.stats_file() + ".lock") as _:
File "/ctcache/b2aa2b3/clang-tidy-cache", line 461, in __enter__
return self.acquire()
File "/ctcache/b2aa2b3/clang-tidy-cache", line 443, in acquire
raise RuntimeError(msg)
RuntimeError: Timeout (3 seconds) exceeded while acquiring lock.
Are we sure there's a hard logic error here? We don't get this error for every file, only when the build is highly parallel which would imply a contention issue rather than a bug per se? Looking at the code...
- When update_stats is called, it first calls read_stats, which locks the stats file, reads the stats and then unlocks the stats file. The update method then locks the stats files again and updates. This could be refactored to all happen within a single lock to reduce contention?
- The
os.path.isfile
conditional in update_stats could be moved outside the file lock to reduce time spent when the stats file is locked? - Could the code be refactored such that stats are accumulated in memory and only written to the stats file when the compilation is completed? Or maybe a better compromise is that we try to write to the stats file every update but suppress errors in general and only if the last stats file update of the compilation errors do we throw an exception?
okay, I'll have a closer look at this, probably during the weekend (if someone else doesn't come up with a PR sooner)
Did you remove the lock file before running new version? I recall I had to remove it to get it working though.
So far I wasn't able to reproduce it locally but I added an env. variable that disables the local stats:
export CTCACHE_NO_LOCAL_STATS=1
In case anyone is following this thread, a few PRs were merged that could help to reduce the contention.
Thanks for the ping @bartdesmet but unfortunately even with the latest changes we are still seeing issues.
Traceback (most recent call last):
File "/ctcache/a9095dc/clang-tidy-cache", line 492, in acquire
self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/remote_cachell/ctcache/stats.lock'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ctcache/a9095dc/clang-tidy-cache", line 563, in update
with MultiprocessLock(self.stats_file() + ".lock") as _:
File "/ctcache/a9095dc/clang-tidy-cache", line 518, in __enter__
return self.acquire()
File "/ctcache/a9095dc/clang-tidy-cache", line 498, in acquire
raise RuntimeError(msg)
RuntimeError: Timeout (3 seconds) exceeded while acquiring lock.
My guess is that /remote_cachell
is on some remote share?
My guess is that
/remote_cachell
is on some remote share?
Yes that's correct. However I would note that we use this remote storage for other caches (e.g. ccache) and have not noted any slowdown or contention.
My guess is that
/remote_cachell
is on some remote share?Yes that's correct. However I would note that we use this remote storage for other caches (e.g. ccache) and have not noted any slowdown or contention.
The reason I asked is because the lock is based on O_EXCL and that may be troublesome over NFS (which I assume your remote share is using).
I've pushed another PR in https://github.com/matus-chochlik/ctcache/pull/56 which uses a different approach to break down the lock into smaller ones.
We don't use a remote share but when a build had a lot of cache hits and a lot of other file system access going on at the same time, we occassionally see the contention. That's for a machine with 40 cores and using ninja, so we got 40 clang-tidy processes running concurrently.
With the sharding of the lock, I haven't seen it repro yet. So maybe you can give ~~that PR~~ latest main
a try as well?