bazel-remote
bazel-remote copied to clipboard
Reduce excessive logging
Bazel-remote is pretty verbose logging all access information.
In particular it becomes problematic for findMissingBlobs
call, which may have thousands of entries, and each entry cases a single line to be logged here: https://github.com/buchgr/bazel-remote/blob/6aa66a2b6098fb38823990d16915f05b12d1bd18/server/grpc_cas.go#L45
Note, the bazel calls this for all actions it processes, causing huge amount of logs - in our case it caused some issues on kubernetes when processing was stalled.
Workaround was to just remove the most problematic logs, but possibly logging could be improved - e.g. in this case just don't log all singular CAS HEAD requests but a summary of those for single findMissingBlobs.
In general maybe accessLogger could log into limited list of entries, and this list could be accessible through some web-point on demand instead of logging those to stdout.
Can you describe the issues you encountered? Was there a bottleneck in bazel-remote, or did it simply flood your log daemon?
It would be pretty simple to add a flag to disable non-error logging, or maybe we could switch to a more efficient logger (zap?).
Symptom was that bazel-remote started to work very slowly after a few hours of working. Note, that it's used as a backed in build-barn setup. Other symptom was that memory usage of bazel-remote skyrocketed to 100G in one hour, and was behaving like this after restarts. Although this maybe red-herring, unfortunately enabled profiling only after this incident so have no more useful data.
The actual root cause is not very well understood by me - whether it was just that bazel-remote become very slow because of logging, or it triggered some threshold on docker / kubernetes side. One thing that was clear was that checking logs I was getting at least 5M logs for each second - actually because of log trimming was not able to get more than that, so it could be even more than 5M/sec. After disabling some logs it just came back to normal and had no issues for the following days.
Missed one important detail: when investigating the issue, found out in bazel stack traces that bazel client was hanging on findMissingBlobs
. Reproducing the RPC call manually, was working very long - like for reduced call from 2k to 100 entries it was still taking 20 sec. This lead to above investigation.
Vanilla 1.3 bazel remote was used (with no proxy).
While checking it, also considered changing locking around cache.Contains - currently mutex is hold for each call of contains, but this call should be fast, so having single lock hold for whole findMissingBlob could be also beneficial - although it's speculative at this point, need to do some measurements, and need to be careful not to call proxy under lock.