bazel-remote icon indicating copy to clipboard operation
bazel-remote copied to clipboard

Reduce excessive logging

Open glukasiknuro opened this issue 3 years ago • 3 comments

Bazel-remote is pretty verbose logging all access information.

In particular it becomes problematic for findMissingBlobs call, which may have thousands of entries, and each entry cases a single line to be logged here: https://github.com/buchgr/bazel-remote/blob/6aa66a2b6098fb38823990d16915f05b12d1bd18/server/grpc_cas.go#L45

Note, the bazel calls this for all actions it processes, causing huge amount of logs - in our case it caused some issues on kubernetes when processing was stalled.

Workaround was to just remove the most problematic logs, but possibly logging could be improved - e.g. in this case just don't log all singular CAS HEAD requests but a summary of those for single findMissingBlobs.

In general maybe accessLogger could log into limited list of entries, and this list could be accessible through some web-point on demand instead of logging those to stdout.

glukasiknuro avatar Mar 19 '21 17:03 glukasiknuro

Can you describe the issues you encountered? Was there a bottleneck in bazel-remote, or did it simply flood your log daemon?

It would be pretty simple to add a flag to disable non-error logging, or maybe we could switch to a more efficient logger (zap?).

mostynb avatar Mar 19 '21 17:03 mostynb

Symptom was that bazel-remote started to work very slowly after a few hours of working. Note, that it's used as a backed in build-barn setup. Other symptom was that memory usage of bazel-remote skyrocketed to 100G in one hour, and was behaving like this after restarts. Although this maybe red-herring, unfortunately enabled profiling only after this incident so have no more useful data.

The actual root cause is not very well understood by me - whether it was just that bazel-remote become very slow because of logging, or it triggered some threshold on docker / kubernetes side. One thing that was clear was that checking logs I was getting at least 5M logs for each second - actually because of log trimming was not able to get more than that, so it could be even more than 5M/sec. After disabling some logs it just came back to normal and had no issues for the following days.

glukasiknuro avatar Mar 19 '21 17:03 glukasiknuro

Missed one important detail: when investigating the issue, found out in bazel stack traces that bazel client was hanging on findMissingBlobs. Reproducing the RPC call manually, was working very long - like for reduced call from 2k to 100 entries it was still taking 20 sec. This lead to above investigation.

Vanilla 1.3 bazel remote was used (with no proxy).

While checking it, also considered changing locking around cache.Contains - currently mutex is hold for each call of contains, but this call should be fast, so having single lock hold for whole findMissingBlob could be also beneficial - although it's speculative at this point, need to do some measurements, and need to be careful not to call proxy under lock.

glukasiknuro avatar Mar 19 '21 18:03 glukasiknuro