OpenHands Refactor history/event stream

CHANGELOG Arch refactoring:

move most event filtering to the event stream
refactor how agents and agent helpers like prompt templates work with the event history
remove the ShortTermHistory class and move all its methods to the event stream or controller, as appropriate.

Tested with UI and main.py; new/old/restored sessions; delegates restore.

Note: I put the filtering delegates in the controller, although it could have been on the stream too. I don't know, I don't have a strong opinion about it, just it seems like adding dependencies on delegate events to the stream is a bit too detailed, and really the reason why delegates are filtered is very agent-related.

Fixes #3731

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:f85d485-nikolaik   --name openhands-app-f85d485   ghcr.io/all-hands-ai/runtime:f85d485

Sep 10 '24 17:09 enyst

This LGTM overall! Getting rid of STH seems like a good idea

Sep 10 '24 17:09 rbren

I'm not sure if we should let the agents access the event stream via State, or directly. The current draft does it directly, it has agents store a reference to the stream, and the effect seems like... a lot of passing around. It would be straightforward to give a ref to the state and be done, everyone would go from state.history -> state.event_stream. On the other hand, the event stream is a "permanent" fixture of the environment where agents run, our "source of truth", it's a little odd to store it in the runtime state (even if we wouldn't pickle it). What do you think? @rbren @neubig

Oct 02 '24 01:10 enyst

Giving the event_stream directly to the agent makes me a little nervous. There are lots of things it could hypothetically do (like publish events and subscribe to events) that really should be handled by the agent_controller.

The agent should only really need a static list of historical events. I wonder if there's a way to do that without duplicating state

Oct 02 '24 17:10 rbren

@rbren Maybe strictly subscribing isn't quite possible, but what is the purpose of the subscription mechanism, if only a few well behaved components respect it, and more than a dozen others access the source freely outside the mechanism?

The agent should only really need a static list of historical events. I wonder if there's a way to do that without duplicating state

We could get a minimally processed list of events, as an actual list with events, like this. What if we never save it, but re-compute it every step, does that make it better? 😅 Minimally processed might mean: basic filtering yes, not sure about delegates filters, summaries no, retrieval no. Then we need an intermediary, like History or the new PromptManager class, only we may want it to work from that list + State, to the elements that the agent needs in the prompt.
What if we don't send the agent a list of Events, but a list of Messages? Probably I'm missing something, but we have a Message class, as a helper serializing events for insertion in the prompt. As of now, it's not possible to go the other way around, but it could be: if we could convert it back to the Event... maybe we can send the agent a list of Messages. Hey, that's not a copy! 😅 Agent helpers, like PromptManager, might need to convert back and do what they may still need to do.
Are you sure it's a static list of events? It's almost fully static today. But an agent with memory will not have a static history. 🤔

Oct 02 '24 19:10 enyst

Some good questions here!

what is the purpose of the subscription mechanism, if only a few well behaved components respect it

I actually think it should be very open to anyone publishing/subscribing--but specifically agents should be excluded there, at least with the current centralized AgentController. But you have me thinking that maybe that's a mistake. Going to think more on this.

What if we don't send the agent a list of Events, but a list of Messages?

The event -> message conversion could definitely be a helpful utility, but we lose a lot of information there, so I do think we need to send the raw events

Are you sure it's a static list of events? It's almost fully static today. But an agent with memory will not have a static history.

IMO we should pass the complete static history to the agent in Python, and then let the agent decide when and where to condense before it passes it to the LLM. Probably we implement a reusable Memory that several agents use, but I'd like for different agents to be able to experiment with their own mem mgmt

Oct 03 '24 14:10 rbren

To clarify, it is not me supporting the idea of direct access for agents, this comes from @neubig in the linked issue:

Should we try to let all agents and the rest reference the event stream directly?

That seems fine to me?

Please consider this an experiment on the branch, for us to see what it looks like, see what it implies, think through what seems right or wrong about it.

Oct 03 '24 18:10 enyst

@rbren I think this is ready for another look. The controller took up a lot of the work, particularly the responsibility to set the agent's history.

It's still a bit rough on details, but I think it shaped up, and I'd appreciate an opinion: is it closer to what you had in mind? Do you see major issues?

My opinion overall is that we can work with this for a while. I admit I'm rather convinced that we will need/want another component that has access to both the event stream and runtime state, and is used by agents, but I've been wrong before! Maybe I'm just overthinking this part for some odd reason. 🤷

Oct 17 '24 18:10 enyst

This seems ready code-wise! I'm going to run an eval on it though just to make sure we don't regress

@rbren please do the run with log_completions=true in config.toml / llm_config if you can. I'd love some of those results! They can be very helpful easily.

Oct 30 '24 14:10 enyst

@mamoodi do we have log_completions support in the eval job?

Oct 30 '24 15:10 rbren

The workflow merged yesterday? It might be broken with function calling actually 😢 https://github.com/All-Hands-AI/OpenHands/pull/4489#issuecomment-2444422162

According to my local tests, DeepSeek doesn't work on main after the tool use PR, it expects a different format than Anthropic, and I don't know yet how well that works after that. This PR would disable its function calling until it works.

Oct 30 '24 15:10 enyst

ah no different workflow that Mahmoud has been working on internally :)

Oct 30 '24 15:10 rbren

@mamoodi do we have log_completions support in the eval job?

Not right now but can be added easily (just adding it to the config.toml) if this is something we need. Should the default be that that is set to true?

Oct 30 '24 19:10 mamoodi

I ran a subset of 93 tasks, the intersection between swe-bench lite and verified, and it can do 63/93. The best result is 64/93.

It looks safe to me. I don't know what is the variance after function calling, before it was up to 12/93 up or down on this subset, and 17/93 seen for 4o.

@rbren please feel free to run a full one if you'd like. I should probably fix the conflicts.

Nov 01 '24 14:11 enyst