OpenHands [NOT FOR MERGE] Adjust prompt to use view command

Give a summary of what the PR does, explaining any non-trivial design decisions

This is the prompt adjustment I used, with the purpose that the LLM uses more the view tool for directories, which is part of its file_editor tool, than other options (ls -R /workspace or ls -la /workspace).

This is used for the initial exploration of the repository a lot.

I think it would be interesting to eval this after Ryan's fix is merged in main.

Edited to add: in the event stream of CodeAct 2.2 swe-bench run, the agent uses ls -R /workspace a lot, and ls -la sometimes.

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:44a6bd9-nikolaik   --name openhands-app-44a6bd9   docker.all-hands.dev/all-hands-ai/openhands:44a6bd9

Dec 10 '24 07:12 enyst

There we go:

THOUGHT: I'll help you implement the necessary changes to fix the issue with `django.utils.http.parse_http_date`. Let's follow the steps you outlined.

1. First, let's explore the repository structure:
CODE:
print(file_editor(**{'command': 'view', 'path': '/workspace/django__django__3.1'}))

OBSERVATION
[Agent Controller default] **IPythonRunCellObservation**
Here's the files and directories up to 2 levels deep in /workspace/django__django__3.1, excluding hidden items:
/workspace/django__django__3.1
/workspace/django__django__3.1/AUTHORS
/workspace/django__django__3.1/CONTRIBUTING.rst
/workspace/django__django__3.1/FUNDING.yml
/workspace/django__django__3.1/Gruntfile.js
/workspace/django__django__3.1/INSTALL
/workspace/django__django__3.1/LICENSE
/workspace/django__django__3.1/LICENSE.python
...

Dec 10 '24 14:12 enyst

I ran 13 instances that are unresolved (0/13) in the CodeAct 2.2 results. They're all on django, and all part of the intersection of Lite with Verified.

CodeAct2.2: 0/13 Branch: 1/13.

Too little to matter, but FWIW! @xingyaoww

Dec 10 '24 16:12 enyst

I'm thinking about whether we should still make this change in the prompt, as encouraging the agent to use view over ls -R can save us on tokens, hence allowing the agent to execute more steps before reaching the context limit 🤔

Dec 11 '24 14:12 ryanhoangt

Running evaluation on the PR. Once eval is done, the results will be posted.

Dec 13 '24 11:12 github-actions[bot]

Evaluation results: ## Summary

submitted instances: 30
empty patch instances: 12
resolved instances: 8
unresolved instances: 22
error instances: 0

Empty patches were from the litellm proxy error:

2024-12-13 11:47:01,561 - ERROR - [Agent Controller default] Error while running the agent: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'message': 'litellm.NotFoundError: AnthropicException - {"type":"error","error":{"type":"not_found_error","message":"model: *"}}\nReceived Model Group=claude-3-5-sonnet-20241022.......
 'code': '404'}}

Dec 13 '24 12:12 openhands-agent

Haven't automated this part yet so here ya go: evaluation.zip

Dec 13 '24 13:12 mamoodi

@openhands-agent Your last attempt to fix the conflicts didn't work. Please do this again: pull main into this branch and fix the conflicts.

Jan 05 '25 02:01 enyst

OpenHands started fixing the pr! You can monitor the progress here.

Jan 05 '25 02:01 openhands-agent

@xingyaoww What are your thoughts on this one?

the 13 instances eval got a small improvement
the 30 instances is inconclusive (12 instances ran into an litellm error, bad day).

In regular use the past month, with the resolver, the llm asks for the view command quite regularly. But in the last official eval (CodeAct 2.2), it doesn't, it uses mostly ls -R /workspace, and I think view really should be better, at least on a large repo like django.

Jan 21 '25 14:01 enyst

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

Jan 21 '25 14:01 xingyaoww

@enyst hmm - i can probably run a larger-scale (100 instance) one later today?

OK, but I can do that, if the remote runtime cooperates today. Or can we sweet-talk Mamoodi to help? ❤️

Jan 21 '25 14:01 enyst

❤️ if it is easy, could you run one? :D LMK if you need more LLM credits and/or remote runtime concurrency. Otherwise let's see if @mamoodi have the bandwidth to help 🙏

Jan 21 '25 14:01 xingyaoww

I'll give it a go!

Jan 21 '25 14:01 enyst

This PR branch:

Summary

submitted instances: 100
empty patch instances: 5
resolved instances: 48
unresolved instances: 52
error instances: 0

Best from another PR:

Summary

submitted instances: 100
empty patch instances: 14
resolved instances: 43
unresolved instances: 57
error instances: 0

Last known main: 41 / 100

It looks good! @xingyaoww full archive is on slack

Jan 21 '25 21:01 enyst

@enyst are you running with max iteration of 100 or 30?

Jan 21 '25 21:01 xingyaoww

30: > claude-3-5-sonnet-20241022_maxiter_30_N_v0.20.0-no-hint-run_1

Jan 21 '25 21:01 enyst

Remove the NOT FOR MERGE before merging? :)

Jan 21 '25 21:01 mamoodi

Oh, indeed Django made the difference! It's by far the largest repo: (x = branch, y = main)

django:
Difference: 8 instances!
X resolved but Y failed: (12 instances)
  ['django__django-11066', 'django__django-11179', 'django__django-11265', 'django__django-11276', 'django__django-12155', 'django__django-12262', 'django__django-12276', 'django__django-12304', 'django__django-12708', 'django__django-12858', 'django__django-13028', 'django__django-13112']
Y resolved but X failed: (4 instances)
  ['django__django-11815', 'django__django-12039', 'django__django-12273', 'django__django-13033']

astropy:
Difference: 3 instances!
X resolved but Y failed: (4 instances)
  ['astropy__astropy-12907', 'astropy__astropy-14096', 'astropy__astropy-14539', 'astropy__astropy-14995']
Y resolved but X failed: (1 instances)
  ['astropy__astropy-14365']

Jan 21 '25 22:01 enyst

Very weird.. after merging this into one of my branch and running a full SWE-Bench verified (compared to our prev 53% run) -- it django actually got a lot of failed :(

I suspect it is because "view" only go up-to two level depth. And at two level, it didn't show the agent which folder is expandable or not.

I'd suggest we can probably show the type of file/folder in the output of view command:

/workspace/django__django__3.0/django/middleware # folder:
/workspace/django__django__3.0/django/shortcuts.py # file
/workspace/django__django__3.0/django/template/ # folder: X files under this directory

Jan 22 '25 21:01 xingyaoww

That is very weird, it doesn't list a directory? How exactly does it get confused? I would love to look into the llm_completions of the failed instances.

The closest I've seen in the previous run looked OK actually, when the LLM needed more depth it did something like this:

let's explore ... view /workspace/django
I see our problem is in <subdirectory>, so let's explore it ... view /workspace/django/subdirectory
Now I understand what happens.

Jan 23 '25 07:01 enyst