[NOT FOR MERGE] Adjust prompt to use view command
Give a summary of what the PR does, explaining any non-trivial design decisions
This is the prompt adjustment I used, with the purpose that the LLM uses more the view tool for directories, which is part of its file_editor tool, than other options (ls -R /workspace or ls -la /workspace).
This is used for the initial exploration of the repository a lot.
I think it would be interesting to eval this after Ryan's fix is merged in main.
Edited to add: in the event stream of CodeAct 2.2 swe-bench run, the agent uses ls -R /workspace a lot, and ls -la sometimes.
To run this PR locally, use the following command:
docker run -it --rm -p 3000:3000 -v /var/run/docker.sock:/var/run/docker.sock --add-host host.docker.internal:host-gateway -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:44a6bd9-nikolaik --name openhands-app-44a6bd9 docker.all-hands.dev/all-hands-ai/openhands:44a6bd9
There we go:
THOUGHT: I'll help you implement the necessary changes to fix the issue with `django.utils.http.parse_http_date`. Let's follow the steps you outlined.
1. First, let's explore the repository structure:
CODE:
print(file_editor(**{'command': 'view', 'path': '/workspace/django__django__3.1'}))
OBSERVATION
[Agent Controller default] **IPythonRunCellObservation**
Here's the files and directories up to 2 levels deep in /workspace/django__django__3.1, excluding hidden items:
/workspace/django__django__3.1
/workspace/django__django__3.1/AUTHORS
/workspace/django__django__3.1/CONTRIBUTING.rst
/workspace/django__django__3.1/FUNDING.yml
/workspace/django__django__3.1/Gruntfile.js
/workspace/django__django__3.1/INSTALL
/workspace/django__django__3.1/LICENSE
/workspace/django__django__3.1/LICENSE.python
...
I ran 13 instances that are unresolved (0/13) in the CodeAct 2.2 results. They're all on django, and all part of the intersection of Lite with Verified.
CodeAct2.2: 0/13 Branch: 1/13.
Too little to matter, but FWIW! @xingyaoww
I'm thinking about whether we should still make this change in the prompt, as encouraging the agent to use view over ls -R can save us on tokens, hence allowing the agent to execute more steps before reaching the context limit 🤔
Running evaluation on the PR. Once eval is done, the results will be posted.
Evaluation results: ## Summary
- submitted instances: 30
- empty patch instances: 12
- resolved instances: 8
- unresolved instances: 22
- error instances: 0
Empty patches were from the litellm proxy error:
2024-12-13 11:47:01,561 - ERROR - [Agent Controller default] Error while running the agent: litellm.NotFoundError: NotFoundError: OpenAIException - Error code: 404 - {'error': {'message': 'litellm.NotFoundError: AnthropicException - {"type":"error","error":{"type":"not_found_error","message":"model: *"}}\nReceived Model Group=claude-3-5-sonnet-20241022.......
'code': '404'}}
Haven't automated this part yet so here ya go: evaluation.zip
@openhands-agent Your last attempt to fix the conflicts didn't work. Please do this again: pull main into this branch and fix the conflicts.
@xingyaoww What are your thoughts on this one?
- the 13 instances eval got a small improvement
- the 30 instances is inconclusive (12 instances ran into an litellm error, bad day).
In regular use the past month, with the resolver, the llm asks for the view command quite regularly. But in the last official eval (CodeAct 2.2), it doesn't, it uses mostly ls -R /workspace, and I think view really should be better, at least on a large repo like django.
@enyst hmm - i can probably run a larger-scale (100 instance) one later today?
@enyst hmm - i can probably run a larger-scale (100 instance) one later today?
OK, but I can do that, if the remote runtime cooperates today. Or can we sweet-talk Mamoodi to help? ❤️
❤️ if it is easy, could you run one? :D LMK if you need more LLM credits and/or remote runtime concurrency. Otherwise let's see if @mamoodi have the bandwidth to help 🙏
I'll give it a go!
This PR branch:
Summary
- submitted instances: 100
- empty patch instances: 5
- resolved instances: 48
- unresolved instances: 52
- error instances: 0
Best from another PR:
Summary
- submitted instances: 100
- empty patch instances: 14
- resolved instances: 43
- unresolved instances: 57
- error instances: 0
Last known main: 41 / 100
It looks good! @xingyaoww full archive is on slack
@enyst are you running with max iteration of 100 or 30?
30: > claude-3-5-sonnet-20241022_maxiter_30_N_v0.20.0-no-hint-run_1
Remove the NOT FOR MERGE before merging? :)
Oh, indeed Django made the difference! It's by far the largest repo: (x = branch, y = main)
django:
Difference: 8 instances!
X resolved but Y failed: (12 instances)
['django__django-11066', 'django__django-11179', 'django__django-11265', 'django__django-11276', 'django__django-12155', 'django__django-12262', 'django__django-12276', 'django__django-12304', 'django__django-12708', 'django__django-12858', 'django__django-13028', 'django__django-13112']
Y resolved but X failed: (4 instances)
['django__django-11815', 'django__django-12039', 'django__django-12273', 'django__django-13033']
astropy:
Difference: 3 instances!
X resolved but Y failed: (4 instances)
['astropy__astropy-12907', 'astropy__astropy-14096', 'astropy__astropy-14539', 'astropy__astropy-14995']
Y resolved but X failed: (1 instances)
['astropy__astropy-14365']
Very weird.. after merging this into one of my branch and running a full SWE-Bench verified (compared to our prev 53% run) -- it django actually got a lot of failed :(
I suspect it is because "view" only go up-to two level depth. And at two level, it didn't show the agent which folder is expandable or not.
I'd suggest we can probably show the type of file/folder in the output of view command:
/workspace/django__django__3.0/django/middleware # folder:
/workspace/django__django__3.0/django/shortcuts.py # file
/workspace/django__django__3.0/django/template/ # folder: X files under this directory
That is very weird, it doesn't list a directory? How exactly does it get confused? I would love to look into the llm_completions of the failed instances.
The closest I've seen in the previous run looked OK actually, when the LLM needed more depth it did something like this:
let's explore ... view /workspace/django
I see our problem is in <subdirectory>, so let's explore it ... view /workspace/django/subdirectory
Now I understand what happens.