OpenHands
OpenHands copied to clipboard
[Bug]: Agent does not crawl through all files
Is there an existing issue for the same bug?
- [x] I have checked the existing issues.
Describe the bug and reproduction steps
[This is not a bug rather a general query. !]
I have a usecase which i am trying to solve using OpenHands resolver. It is a modernisation journey where multiple repositories which needs modernisation are modernised using OH. When i mean modernisation
- Search for a specific css properly
- Replace it with css variable (predefined)
To make this happen 2 things are crucial
- All the specified files are visited and analysed
- All instances are replaced in all files
- Context (The usage of css within code) is important.
While this works great for couple of files, these are my observation
- Agent/LLM fails to crawl through all the files and make changes
- Agent/LLM does not analyse all instances of hardcoding within a file.
- Cross file usage check does not happen at all.
Which makes me think if this is even the right usecase i am trying to solve using agents. I have tried playing around with the prompt to make explicit instructions like "analyse ALL files", "check ALL instances" etc etc but the behaviour has been unpredictable still.
Looking for
- Some recommendations around this to fix the issue
- Recommendations on the examples which i can take inspiration from.
Thanks !
OpenHands Installation
Docker command in README
OpenHands Version
0.23
Operating System
MacOS
Logs, Errors, Screenshots, and Additional Context
No response
Hi @Vikki123 , I have encountered this behavior of "lazy claude" myself as well, it's a bit frustrating :) My suggestion would be:
Please make a checklist of all files that need to be check.
Read through all the files, updating the checklist each time you finish checking a file.
When you are done, revisit the checklist and all the modified files, and make sure that you have actually completed all of them.
Thanks @neubig for the sugestion. Followup ! What do you mean when you said
Read through all the files, updating the checklist each time you finish checking a file.
Is that the prompt to the LLM or you are suggesting some code modification on resolver side for this ?
Those three lines are the prompt that I would provide to the LLM.
I was actually thinking of creating bunch of new tools like
def getNextIssue(self):
# ... (Returns fileName, lineNumber of the next issue) ...
def isIssueFinished(self, lineNumber, fileName):
# ... (Marks the issue as resolved and writes changes to file) ...
def hasCompletedProcessingFile(self):
# ... (Returns True if all issues are resolved) ...
def getContext(self, fileName, lineNumber):
# ... (Extracts and returns the relevant code snippet) ...
def allDone(self):
# ... (Extracts and returns the relevant code snippet) ...
etc which could be accessed by LLM for getting the step by step inputs. Do you think its an overkill ?
Those three lines are the prompt that I would provide to the LLM.
Got it .. Let me do a quick check. But in general are the agents capable of traversing the whole repo ? Also can they efficiently handle the cross file usage analysis ?
@Vikki123 I do something similar to what @neubig recommended. I usually do it in a couple prompts, ask it to make the checklist, confirm it did it correctly then tell it to do the task, check off the list after each task is done and then report back afterwards. It doesn't always work, especially if the checklist has groups of tasks. Then it will finish a group and message me. What seems to work pretty well is to ask it for continue exactly the same way without messaging me until you complete them all.
Thanks @amirshawn .. I did try the way @neubig mentioned and it did good job for 15+ files. But for a big repo i see that it says completed even even the processing is not complete ! Do you have a example prompt whch i can refer to ?
Also makes me wonder about would history of messages that is being sent to LLM might be causing problem ?
@amirshawn @neubig I have the below prompt for ref where the issue still persists. If there are any suggestions with your prior experience it would be of great help thanks !
============================
Please fix the following issue for the repository in /workspace. An environment has been set up for you to start working. You may assume all necessary tools are installed.
Please make a checklist of all files and all line numbers within the file that need to be checked. Read through all the files and lines, updating the checklist each time you finish checking a file. When you are done, revisit the checklist and all the modified files and lines, and make sure that you have actually completed all of them.
CONTINUE exactly the same way WITHOUT messaging me until you complete them ALL.
#Problem Statement
You are a code refactoring agent tasked with improving the consistency and maintainability of a codebase by replacing hardcoded CSS values with design system tokens. Your analysis is limited to .css, .scss, .jsx, .tsx, .js, and .ts files. Exclude test files, mock files, and repository building blocks such as tsconfig, eslint, and files listed in .gitignore.
Crucial First Step: Token File Acquisition (Mandatory)
-
Download Token File: The agent must download the CSS token file from "https://url.css" before attempting any code analysis or refactoring. This step is absolutely mandatory. If the download fails for any reason, the agent must terminate the process and report the error.
-
Use Downloaded Tokens Only: The agent must use only the tokens from this downloaded file for refactoring. It must not attempt to create its own tokens or use any other token source.
-
Cache: The agent should cache the downloaded token file as described in the previous prompt.
Crucial instruction: Files to be analysed
Agent MUST analyse only below files and lines to perform the task mentioned in next section. Make sure to analyse all the files. You CANNOT miss out on any file or line.
{
fileName: [lineNumbers]
}
Task: Replace Hardcoded CSS Values with Tokens
Search the files for hardcoded CSS values used with the following CSS properties.
border-radius
border-top-left-radius
border-top-right-radius
border-bottom-left-radius
background
border
border-top
border-right
border-bottom
border-left
text-shadow
Important note (Mandatory): Make sure to follow the next set of instructions for the COMPLETE FILE until there is NO hardcoded value remaining in the file for below properties.
Instructions for Replacing Hardcoded Values:
For each instance of a hardcoded value found for the CSS properties listed above, perform the following steps:
-
Analyze the Context: Determine the purpose and context of the CSS property usage. Consider the element the style is applied to, its surrounding elements, class names, IDs, and its usage across the codebase. This context is crucial for selecting the correct token. For example, a background color applied to a class named
page-containerlikely has a different purpose than a background color applied to a button. -
Find the Closest Matching Token: Use only the tokens defined in the downloaded CSS token file. Search for a token that best matches the context and the normalized hardcoded value. The token names follow a structured convention (Element-Prominence-Purpose-Attribute-State). Prioritize tokens that closely align with the element, prominence, purpose, attribute, and state of the hardcoded value's usage in code (including other files). If an exact match isn't found, select the closest available match based on context, value, and code usage.
-
Replace the Hardcoded Value: Replace the hardcoded CSS value with the chosen CSS variable (token). Ensure the syntax is correct (e.g.,
background-color: var(--color-page-background-primary);). Preserve any existing whitespace or formatting in the surrounding code as much as possible.
Example:
.container {
background-color: #ffffff;
}
In this example, the hardcoded value is #ffffff. Search the CSS variable list for the nearest value. The nearest value might be --color-container-background-primary: #ffffff;. So after replacement:
.container {
background-color: var(--color-container-background-primary);
}
DO NOT ATTEMPT REPLACEMENT IF THE VALUE IS ALREADY A CSS VARIABLE FROM THE LIST.
Flexible Matching: Do not rely on verbatim string matching and replacement. Instead:
Processing Strategy:
-
File-Level Chunking: Process the codebase file by file.
-
Complete Coverage: The agent must process all eligible files (those mentioned in Files to be analysed section). Do not stop processing until all files have been analyzed and refactored as needed.
-
Context is Paramount: The agent must carefully analyze the context of each CSS property usage to select the best replacements.
etc which could be accessed by LLM for getting the step by step inputs. Do you think its an overkill ?
You could try! I'm not optimistic on that, but it might be only me: the problem seems to be that it doesn't keep track of its plan, which this doesn't seem to solve? Also, I don't know, IMHO many tools could, after some point, confuse it a bit. But I could be wrong! LLMs don't always our intuitions about what should work.
You use Sonnet?
Just a few thoughts:
- I had a surprise with Gemini 2.0 flash thinking, it was able to make itself plans, and follow up on them. I've been a bit impressed how it has followed the execution flow file by file, on a large repo, to track variables, BUT the files needed weren't that many as you suggest, so who knows.
- R1 might be better, but I haven't tried it on this kind of scenario
- what if you tell the LLM also to make itself a .md file to track its status? to make the list / plan in the file, add and remove a file when it's done with it? maybe, if the file is correct, even if it finishes it could restart from there.
Thanks @enyst ..
which this doesn't seem to solve?
The idea is not even letting LLM have a checklist rather we maintain the checklist and provide it methods to interact with the checklist (get, update, isFinished etc). Let me try this out and update here if it makes any difference.
You use Sonnet?
Have tried with Sonnet and gpt-4o.. Let me try with Gemini once as you suggested
if you tell the LLM also to make itself a .md file to track its status
Let me try this and update here.
Are you sending that all as one prompt? I would split that up. I would ask it create a checklist file with all the steps and tell it the file name and the location. Then I would work with it to create instruction files for each step. Then I'd ask it to open that step, read the whole file, follow it exactly. Once it's complete, read it over to confirm everything has been done and when you're sure check off the checklist and move on to the next step. It then seems to take a few steps sometimes to get it to not stop in between steps. A lot of times it will make it through the step then ask you if you'd like it to continue at which point I say something like. Continue exactly how you just did in sequence through all the steps in silence. Once you complete all steps give me a progress report. It seems to stay on task pretty well when I do something like this. I really depends to on how much data you are dealing with. If it's large files, you will lose context very quickly. If it's lots of small files then it can stay on task for hours.
@amirshawn yes. Just to give some context i am just using OH resolver with the instructions mentioned here. I am not sure of how to pass multiple prompts in resolver though. I read through the code and i am guessing i need to make changes here and invoke run_controller with multiple sub prompts. Is that right / is there a better way of doing this ?
This worked like a charm. Thank you! Closing the issue for now.
what if you tell the LLM also to make itself a .md file to track its status? to make the list / plan in the file, add and remove a file when it's done with it? maybe, if the file is correct, even if it finishes it could restart from there.
https://github.com/All-Hands-AI/OpenHands/issues/6713#issuecomment-2661008472
Thank you for the follow-up ! I also hit a point where I find myself doing the same quite a bit 😅
Maybe we can integrate that checklist somehow in microagents of type 'TASK', which are started in the codebase, but not actually implemented / usable yet.