buildbot icon indicating copy to clipboard operation
buildbot copied to clipboard

process: defer log processing to thread

Open tdesveaux opened this issue 11 months ago • 5 comments

RemoteCommand log processing (eg. split to lines) is done on the main thread. On our instance that can run dozens of builds which each generate large logs concurrently, it causes the web UI to be overall sluggish.

Note that buildbot.test.integration.interop.test_commandmixin.CommandMixinMasterPB.test_commandmixin has a flaky failure introduce by the last commit (changes on RemoteCommand).

twisted.spread.pb.PBConnectionLost: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.

Looks more like an issue with tests not shutting something properly? Not sure where to look at this, if you have any pointer.

Contributor Checklist:

  • [n/a] I have updated the unit tests
  • [x] I have created a file in the newsfragments directory (and read the README.txt in that directory)
  • [n/a] I have updated the appropriate documentation

tdesveaux avatar Jan 28 '25 18:01 tdesveaux

In cases where master activity starts to affect web responsivity, it makes sense to start looking into multimaster setup. That would solve it properly with possibility to scale as Buildbot will no longer be bound by a single thread.

p12tic avatar Feb 07 '25 20:02 p12tic

In cases where master activity starts to affect web responsivity, it makes sense to start looking into multimaster setup. That would solve it properly with possibility to scale as Buildbot will no longer be bound by a single thread.

It's not that impactful and is only a bit annoying at times where lots of jobs are running, so it's mostly "side-work" for me. Also, we tried multi-master some years ago on a small instance, and encountered issues with the masters disconnecting from crossbar, requiring a restart to restore the connection. So I'm a bit reluctant to use it on an instance which has an heavier load.

When I looked into this and saw the issue of CPU intensive work being made on the main thread, and our 8-CPU server sits at a fairly low CPU usage (never more than ~35%). Profiling shown the cause this patch aim to change. Since log processing work is mostly done in regex code, which should release the GIL, so threading should help.

Regarding the failing tests here, I think it's due to remote_update now awaiting log processing. Anyway, in turn, delay the remote_complete call from the worker, now being called AFTER the interrupt was called. The fact I can't reproduce would then make sense as I'm less resource constrained locally.

I believe it's a good thing to not fire and forget log processing to prevent a potential overloading of the master by one worker generating logs faster than the master could process them. ~So I think catching the exception when the command was interrupted is the correct fix here~ (edit: it's not working at all, ConnectionDone can't be caught like that, I was too hasty).

tdesveaux avatar Feb 10 '25 17:02 tdesveaux

Sounds fair. I will think a bit about this PR.

p12tic avatar Feb 10 '25 22:02 p12tic

Don't worry about it too much. It's more a "this doesn't sit right with me" rather than an issue.

I might try out using msgpack workers since the split lines is done worker-side, whichi would solve this issue.

tdesveaux avatar Feb 12 '25 10:02 tdesveaux

Since I got another alert about this.

Another issue with this is that we use a simple HTTP query on / as a health check (maybe there is a better way?) and we get false negative some times as the master doesn't answer since it's doing this stuff on it's main thread.

tdesveaux avatar May 14 '25 12:05 tdesveaux