bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Retain ackQueue as request event messages

Open werkt opened this issue 1 year ago • 2 comments

The queue of to-be-acknowledged events can grow indefinitely as a remote endpoint fails to deliver acknowledgements for event sequence numbers. This unbounded growth can leave references to resources associated with events in their rich form, particularly for file uploads (via futures), where Paths retain a reference to their FileSystem. Over the life of the build, this will prevent heap reclamation during garbage collection, and require heap retention relative to the number of actions' FileSystems, which will, in conjunction with Build without the Bytes remote execution, be RemoteActionFileSystems with unique maps of inputs.

This change reduces the footprint of the events ack queue strictly to the size of the message, which will at least by definition be less than the cost of retaining maps of all actions' inputs over the lifetime of the build, and ideally be a smaller footprint with less work to perform for ack sends of constructed messages on a requirement for retransmit.

werkt avatar Jun 11 '24 15:06 werkt

@bazelbuild/remote-execution Can anyone take a look at this?

werkt avatar Jul 18 '24 14:07 werkt

@tjgq per conversation

werkt avatar Oct 15 '24 18:10 werkt

This looks good to me. I would rename the new class (eg. SendSerializedBuildEventCommand) but the logic seems sound.

Unfortunately it doesn't merge cleanly at this point, so I'll have to merge it manually.

I want to stare at this a little bit more to make sure I'm not missing something obvious. But I think this is correct!

michaeledgar avatar Sep 25 '25 20:09 michaeledgar

A year's worth of changes has made this an extremely difficult exercise in a logical merge. I do not see an easy path to definitively reject any of the active runtime content without bleeding the proto details up into an area where they were clearly removed in that year. Maybe you see one that works more clearly to me @michaeledgar, but I'm concerned that any miss on severing the references that my original implementation was designed to avoid will just complicate the current code and leave the leaks effective. I don't relish the idea of having to find this again with the memory profile.

werkt avatar Sep 26 '25 03:09 werkt

I've done a rewrite of this with a rebase. I don't think it feels completely correct, or completely wrong: The serialized BuildEvent usage is ignored in BESUploader. I leave it to the client impl to provide a suitable StreamEvent for retry. The constructed message request is attached to it. The permission access to sealed classes is awkward - I didn't even think java would accept it as circularly as declared, but I didn't want to bleed the proto into the interface. I removed the future cancellation, the real problem with the leak, from the ackQueue entirely. Reason being that a regular message, the only one with an attached upload, can't make it into the ackQueue without already calling get() on the upload, making the cancellation meaningless. I pushed the lastEvent retry logic together with a tracker for its sequence number. Probably the cleanest part of this change.

werkt avatar Sep 26 '25 05:09 werkt

Sorry for not clarifying - I worked on my own merge, not suspecting you'd be available so quickly to discuss this change! I've pushed it to github here: https://github.com/bazelbuild/bazel/compare/master...michaeledgar:bazel:adgar-merge-22696

Our two merges are not that different, though I'm curious if you think the differences are important.

I've done a rewrite of this with a rebase. I don't think it feels completely correct, or completely wrong: The serialized BuildEvent usage is ignored in BESUploader. I leave it to the client impl to provide a suitable StreamEvent for retry. The constructed message request is attached to it.

The approach I went with was to just store the StreamEvent in the new Command implementation. AFAICT, the StreamEvent contains the serialized BuildEvent as a ByteString and then the only other interesting reference is the CommandContext - which has just a handful of Strings -- no references to large object graphs. This avoids adding another kind of StreamEvent - WDYT?

I removed the future cancellation, the real problem with the leak, from the ackQueue entirely. Reason being that a regular message, the only one with an attached upload, can't make it into the ackQueue without already calling get() on the upload, making the cancellation meaningless.

Good call!

michaeledgar avatar Sep 26 '25 14:09 michaeledgar

Sorry for not clarifying - I worked on my own merge, not suspecting you'd be available so quickly to discuss this change!

If I (or anyone) puts up a PR, it is reasonable to expect that they are interested enough in having it land to respond. In particular, putting up a post on the PR before expending any effort trying to manually merge them.

Our two merges are not that different, though I'm curious if you think the differences are important.

The approach I went with was to just store the StreamEvent in the new Command implementation. AFAICT, the StreamEvent contains the serialized BuildEvent as a ByteString and then the only other interesting reference is the CommandContext - which has just a handful of Strings -- no references to large object graphs. This avoids adding another kind of StreamEvent - WDYT?

I don't think that the differences are particularly important. Personally, I think the Any bazel_event payload and its interaction is odd, both in and out of this change, but it gets the job done.

The StreamEvent already has a sequenceNumber(), not sure why you're adding that to the Serialized Command - have it route through.

I removed the future cancellation, ...

Good call!

This is really the only value to this change and its effect on memory used. All of the other message retention stems from it. So long as we're not holding a reference to this PathConverter, the goal is achieved.

If you do go off on your own merge for a PR, please prepare a commensurate change to the 8.x branch, as some of us are going to be stuck with it for a good while longer.

werkt avatar Sep 26 '25 17:09 werkt

One more bug fixed in my merge: we need a way to cancel the pending file uploads for whatever command is in-flight (popped off the commandQueue) in the case of InterruptedException or LocalFileUploadException. This ensures we don't leave a single event's file uploads retrying in the background.

michaeledgar avatar Sep 29 '25 16:09 michaeledgar