OpenDevin WebSocket API

WebSocket API

Open yimothysu opened this issue 2 months ago • 2 comments

It seems to me that the frontend is primarily displaying what OpenDevin is doing to the user for visibility. The actual agent is implemented on the backend.

We'll therefore want to stream a lot of information from the backend to the frontend via WebSockets and/or Server-Sent Events. Each module of OpenDevin should receive its own events.

Below is a draft of what the events for such a WebSocket API might look like.

Terminal

terminal writes to the terminal. terminal.write(...) is a function in xterm.js, so we can forward the terminal sequences directly from the backend to the frontend. the paylod might look like

{
    "content": "\x1B[1;3;31OpenDevin\x1B[0m $"
}

Planner

planner writes to the planner in MarkDown format, which the frontend renders. we could reuse the same payload as the code endpoint below since the planner state can be represented as a single .md file.

Code

code streams code, which the frontend renders syntax-highlighted in a code editor. the code may be stored in a string array, where each element is a line of code. the payload might look like

{
    "line": 109
    "change": "INSERT",
    "content": [
        "with open(\"tmp.txt\") as f:",
        "\tcontent = f.read()"
    ]
}

line: the line number at which the code change begins
change: the type of change being made ("INSERT" or "DELETE")
content: the lines of code to insert

Browser

navigate navigates to a URL and sends a screenshot every second (or every page change). the frontend displays this URL and screenshot.

it's possible to render an <iframe />, but

this seems unnecessary because the backend already needs to access pages via Selenium
this can have security/reliability issues (such as CORS)

the payload might look like

{
    "url": "https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html",
    "screenshot": "data:image/png;base64, ..."
}

Mar 18 '24 04:03 yimothysu

What do you think about using the diff patch format for code changes? Since it seems that the SWE bench requires results in diff format, it would allow us to reuse it. On the other hand, our frontend may handle the format you suggest more easily.

Mar 18 '24 22:03 enyst

Using the diff patch format is possible, but requires more preprocessing. We can also run SWE bench headless and use git to generate diffs.

Mar 19 '24 00:03 yimothysu

I like the idea of using code to do the markdown plan. The agent tries to write markdown sometimes anyways--if we can just tell it to always use DevinPlan.md, that will kill two birds with one stone.

Mar 19 '24 21:03 rbren

I also like the idea of file read/write going over the wire, instead of the agent editing files directly (which is currently what my agent implementation does).

For browse, IMO we'll get a lot more mileage by sending HTML instead of screenshots. 1 screenshot per second would be a lot to process.

Mar 19 '24 21:03 rbren

For code edits, we'll probably also want to be able to replace a range of lines. I.e. "replace lines 60-100 with this new code"

Mar 19 '24 21:03 rbren

FYI: I have an implementation of the websocket handshake here (but with zero of the operations above): https://github.com/OpenDevin/OpenDevin/pull/57

Mar 19 '24 21:03 rbren

@rbren

For browse, IMO we'll get a lot more mileage by sending HTML instead of screenshots. 1 screenshot per second would be a lot to process.

Totally, I should have specified this is for server : frontend communication. The server (or perhaps agent) should spin up a Selenium instance. The HTML is sent from Selenium to the agent while a screenshot of the current webpage in Selenium should be sent to the frontend per page change.

For code edits, we'll probably also want to be able to replace a range of lines. I.e. "replace lines 60-100 with this new code"

This is equivalent to a 40-line DELETE followed by an INSERT. Do you think we should have an explicit REPLACE change type?

Mar 19 '24 22:03 yimothysu

Totally, I should have specified this is for server : frontend communication. The server (or perhaps agent) should spin up a Selenium instance. The HTML is sent from Selenium to the agent while a screenshot of the current webpage in Selenium should be sent to the frontend per page change.

Awesome, agree. pyppetteer could be worth investigating.

This is equivalent to a 40-line DELETE followed by an INSERT. Do you think we should have an explicit REPLACE change type?

As far as what the LLM will do, it'll be much easier for it to REPLACE than do a DELETE followed by an INSERT--at least if we're limiting it to 1 action per prompt (currently the case, but up for discussion)

Though we could always "translate" an LLM replace command into a delete+insert

Mar 19 '24 22:03 rbren

Makes sense, either way seems reasonable to me!

Mar 19 '24 22:03 yimothysu

I have a first pass at a websocket API here: https://github.com/OpenDevin/OpenDevin/pull/97

The client opens a websocket, and then client and server pass messages JSON messages back and forth. Both client and server messages have the same format:

action: the action being taken (e.g. run command, write file)
args: any arguments being passed to that action
message: a human readable message.

A typical flow would look like this: User:

{"action": "start", {"args": {"task": "write a bash script that prints hello"}}

Server:

{"message":"Starting new agent..."}
{"action":"run","message":"Running command: ls","args":{"command":"ls"}}
{"action":"output","message":"Got output.","args":{"output":"LICENSE\nOpenDevinLogo.jpg\nREADME.md\nagenthub\nenv_name\nevaluation\nfrontend\nhello.sh\nopendevin\nrequirements.txt\nserver\nworkspace\n"}}
{"action":"read","message":"Reading file: hello.sh","args":{"path":"hello.sh"}}
{"action":"output","message":"Got output.","args":{"output":"#!/bin/bash\necho \"hello\""}}
{"action":"run","message":"Running command: bash hello.sh","args":{"command":"bash hello.sh"}}
{"action":"output","message":"Got output.","args":{"output":"hello\n"}}
{"action":"think","message":"I've successfully executed the bash script hello.sh which printed 'hello'. My primary task is complete. It's time to finalize my work.","args":{"thought":"I've successfully executed the bash script hello.sh which printed 'hello'. My primary task is complete. It's time to finalize my work."}}
{"action":"finish","message":"Finished!","args":{}}

The user can also issue actions like

{"action":"run", "args":{"command":"git commit -a -m 'save work'"}}

Mar 23 '24 20:03 rbren

I think we can close this one now

Mar 25 '24 15:03 rbren

OpenDevin OpenDevin copied to clipboard

WebSocket API

Terminal

Planner

Code

Browser

OpenDevin
OpenDevin copied to clipboard