ctcache icon indicating copy to clipboard operation
ctcache copied to clipboard

Frequent stats.lock contention

Open andrewkcorcoran opened this issue 1 year ago • 15 comments

Hi,

Since trying to move to the latest version of ctcache (with the stats added to local-mode) we're getting frequent errors below. Looks to be heavy contention on the stats file when running parallel builds. Interestingly these errors don't cause any builds failures... are the exceptions being ignored and the build has actually not completed or are these errors benign and should therefore be turned into warnings instead of errors?

Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
  File "/ctcache/dac994e/clang-tidy-cache", line 436, in acquire
    self.lock_handle = os.open(self.lockfile, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/ctcache/stats.lock'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ctcache/dac994e/clang-tidy-cache", line 486, in update_stats
    hits, misses = self.read_stats()
  File "/ctcache/dac994e/clang-tidy-cache", line 473, in read_stats
    with FileLock(self.stats_file() + ".lock") as _:
  File "/ctcache/dac994e/clang-tidy-cache", line 460, in __enter__
    return self.acquire()
  File "/ctcache/dac994e/clang-tidy-cache", line 443, in acquire
    raise RuntimeError(msg)

andrewkcorcoran avatar Jan 08 '24 11:01 andrewkcorcoran

I also found this issue in #35 where lock file is not closed before unlink, so the file still exists and is locked.

solarispika avatar Jan 09 '24 03:01 solarispika

I've added code that closes the file lock handle before unlinking it from the filesystem, can you please test if that solves your issue?

matus-chochlik avatar Jan 09 '24 05:01 matus-chochlik

Issue still persists.

Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
  File "/ctcache/ef992ef/clang-tidy-cache", line 436, in acquire
    self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/ctcache/stats.lock'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ctcache/ef992ef/clang-tidy-cache", line 487, in update_stats
    hits, misses = self.read_stats()
  File "/ctcache/ef992ef/clang-tidy-cache", line 474, in read_stats
    with FileLock(self.stats_file() + ".lock") as _:
  File "/ctcache/ef992ef/clang-tidy-cache", line 461, in __enter__
    return self.acquire()
  File "/ctcache/ef992ef/clang-tidy-cache", line 443, in acquire
    raise RuntimeError(msg)

andrewkcorcoran avatar Jan 09 '24 08:01 andrewkcorcoran

Root cause should be this:

Traceback (most recent call last):
  File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 490, in update_stats
    hits, misses = self.read_stats()
                   ^^^^^^^^^^^^^^^^^
  File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 477, in read_stats
    with FileLock(self.stats_file() + ".lock") as _:
  File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 467, in __exit__
    self.release()
  File "C:\Users\pika1\code\OrangeDrive\ci-tools\ctcache\clang-tidy-cache", line 456, in release
    self._lock_handle.close()
    ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'close'

https://github.com/matus-chochlik/ctcache/blob/ef992efb098292053683ac0a3e678d06f05111b8/clang-tidy-cache#L453 I think you should close the file by os.close(self._lock_handle)

solarispika avatar Jan 09 '24 10:01 solarispika

just pushed a new commit that uses os.close

matus-chochlik avatar Jan 09 '24 16:01 matus-chochlik

did the new changes help?

matus-chochlik avatar Jan 11 '24 05:01 matus-chochlik

No, still getting errors. The other thing that is very worrying is that I'm seeing all these errors in the log but the compiler doesn't exit with an error. Is this intended? Are we 100% certain the compilation actually continued after this exception, because if not we should fail the compilation?

Timeout (3 seconds) exceeded while acquiring lock.
Traceback (most recent call last):
  File "/ctcache/b2aa2b3/clang-tidy-cache", line 436, in acquire
    self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/external/5ll/ctcache/stats.lock'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ctcache/b2aa2b3/clang-tidy-cache", line 487, in update_stats
    hits, misses = self.read_stats()
  File "/ctcache/b2aa2b3/clang-tidy-cache", line 474, in read_stats
    with FileLock(self.stats_file() + ".lock") as _:
  File "/ctcache/b2aa2b3/clang-tidy-cache", line 461, in __enter__
    return self.acquire()
  File "/ctcache/b2aa2b3/clang-tidy-cache", line 443, in acquire
    raise RuntimeError(msg)
RuntimeError: Timeout (3 seconds) exceeded while acquiring lock.

Are we sure there's a hard logic error here? We don't get this error for every file, only when the build is highly parallel which would imply a contention issue rather than a bug per se? Looking at the code...

  1. When update_stats is called, it first calls read_stats, which locks the stats file, reads the stats and then unlocks the stats file. The update method then locks the stats files again and updates. This could be refactored to all happen within a single lock to reduce contention?
  2. The os.path.isfile conditional in update_stats could be moved outside the file lock to reduce time spent when the stats file is locked?
  3. Could the code be refactored such that stats are accumulated in memory and only written to the stats file when the compilation is completed? Or maybe a better compromise is that we try to write to the stats file every update but suppress errors in general and only if the last stats file update of the compilation errors do we throw an exception?

andrewkcorcoran avatar Jan 11 '24 05:01 andrewkcorcoran

okay, I'll have a closer look at this, probably during the weekend (if someone else doesn't come up with a PR sooner)

matus-chochlik avatar Jan 11 '24 06:01 matus-chochlik

Did you remove the lock file before running new version? I recall I had to remove it to get it working though.

solarispika avatar Jan 11 '24 12:01 solarispika

So far I wasn't able to reproduce it locally but I added an env. variable that disables the local stats:

export CTCACHE_NO_LOCAL_STATS=1

matus-chochlik avatar Jan 13 '24 09:01 matus-chochlik

In case anyone is following this thread, a few PRs were merged that could help to reduce the contention.

bartdesmet avatar Apr 12 '24 01:04 bartdesmet

Thanks for the ping @bartdesmet but unfortunately even with the latest changes we are still seeing issues.

Traceback (most recent call last):
  File "/ctcache/a9095dc/clang-tidy-cache", line 492, in acquire
    self._lock_handle = os.open(self._lock_path, os.O_CREAT | os.O_EXCL)
FileExistsError: [Errno 17] File exists: '/remote_cachell/ctcache/stats.lock'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ctcache/a9095dc/clang-tidy-cache", line 563, in update
    with MultiprocessLock(self.stats_file() + ".lock") as _:
  File "/ctcache/a9095dc/clang-tidy-cache", line 518, in __enter__
    return self.acquire()
  File "/ctcache/a9095dc/clang-tidy-cache", line 498, in acquire
    raise RuntimeError(msg)
RuntimeError: Timeout (3 seconds) exceeded while acquiring lock.

andrewkcorcoran avatar Apr 12 '24 04:04 andrewkcorcoran

My guess is that /remote_cachell is on some remote share?

bartdesmet avatar Apr 18 '24 19:04 bartdesmet

My guess is that /remote_cachell is on some remote share?

Yes that's correct. However I would note that we use this remote storage for other caches (e.g. ccache) and have not noted any slowdown or contention.

andrewkcorcoran avatar Apr 22 '24 12:04 andrewkcorcoran

My guess is that /remote_cachell is on some remote share?

Yes that's correct. However I would note that we use this remote storage for other caches (e.g. ccache) and have not noted any slowdown or contention.

The reason I asked is because the lock is based on O_EXCL and that may be troublesome over NFS (which I assume your remote share is using).

I've pushed another PR in https://github.com/matus-chochlik/ctcache/pull/56 which uses a different approach to break down the lock into smaller ones.

We don't use a remote share but when a build had a lot of cache hits and a lot of other file system access going on at the same time, we occassionally see the contention. That's for a machine with 40 cores and using ninja, so we got 40 clang-tidy processes running concurrently.

With the sharding of the lock, I haven't seen it repro yet. So maybe you can give ~~that PR~~ latest main a try as well?

bartdesmet avatar Apr 24 '24 03:04 bartdesmet