[Experimental] Integrate repomap
End-user friendly description of the problem this fixes or functionality that this introduces
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR is to:
- [x] Integrate aider's RepoMap implementation, at max 3072 tokens. I put it into a separate repo here to make it easier to play around, experiment and test more thoroughly.
- [ ] Run evaluation to see how it affects
swe-benchscore compared to baseline (@xingyaoww not sure if we can use the nightly evaluation pipeline for this, to try on a small subset, my api key easily reaches token rate limit with 1 worker only so it's a bit slow to do on my side)
When adding repomap, there're 2 phases to execute for every iteration in the agentic loop:
- first is to get repomap content from the repo, which I'm trying to implement as an agentskill
- second is to put the content inside the message history, before the environment reminder
I played around with it for a bit and my first impression is the personalization for mentioned files/identifiers still doesn't work really well yet. I'm trying to do a bit of of tuning on the ranking, and perhaps combining with static ranking instead of the built-in pagerank support. If you have some ideas/alternatives regarding this, it'd be great to discuss here. Also I'm not sure if there're any relevant benchmark so we can test various strategies of localization. (Cc: @neubig if you have some ideas and suggestions to share on these)
Some notes:
- Aider has a concept of "files added to chat", which can help the agent to focus on particular files in a session. These files are weighted more in the pagerank score calculation, while the current integration only parses exact relative filepaths users mention from message history. If there're no files/identifiers (e.g. class names, method names) mentioned, the repomap will likely retrieve irrelevant context.
Link of any specific issues this addresses
Fixes https://github.com/All-Hands-AI/OpenHands/issues/2185
Cc: @enyst if you can have a look and suggest some better ways to do integration.
PR Reviewer Comments
After analyzing the pull request, here are my findings:
Overall Feedback:
The integration of the RepoMap functionality appears to be a significant enhancement to the project, allowing for better context retrieval based on user interactions. However, there are several areas where the code can be improved for better clarity, performance, and error handling.
Score: 85/100
Labels: Enhancement, Bug fix
Code Suggestions:
-
File:
openhands/agenthub/codeact_agent/codeact_agent.py- Suggestion Content: Ensure that the
actionvariable is always initialized before being returned to avoid potential reference errors. - Relevant Line:
+ action = IPythonRunCellAction(...) - Existing Code:
action = IPythonRunCellAction( code=get_repomap_code, thought='', kernel_init_code='from agentskills import *', ) - Improved Code:
action = IPythonRunCellAction( code=get_repomap_code, thought='', kernel_init_code='from agentskills import *', ) if action is not None else DefaultAction()
- Suggestion Content: Ensure that the
-
File:
openhands/runtime/plugins/agent_skills/file_ops/file_ops.py- Suggestion Content: Consider adding error handling in the
get_repomapfunction to manage potential issues when accessing the repository or generating the map. - Relevant Line:
+ map_result = REPO_MAP.get_map(message_history=message_history) - Existing Code:
map_result = REPO_MAP.get_map(message_history=message_history) - Improved Code:
try: map_result = REPO_MAP.get_map(message_history=message_history) except Exception as e: print(f"Error retrieving RepoMap: {e}") return None
- Suggestion Content: Consider adding error handling in the
-
File:
openhands/runtime/plugins/agent_skills/file_ops/file_ops.py- Suggestion Content: Use a more descriptive name for the
get_repomapfunction to clarify its purpose. - Relevant Line:
+def get_repomap(repo_path: str = './', message_history='') -> None: - Existing Code:
def get_repomap(repo_path: str = './', message_history='') -> None: - Improved Code:
def retrieve_repo_map(repo_path: str = './', message_history='') -> None:
- Suggestion Content: Use a more descriptive name for the
The integration of RepoMap is a valuable enhancement! Here are some suggestions:
- Ensure the
actionvariable is always initialized before being returned to avoid potential reference errors. - Add error handling in
get_repomapto manage potential issues when accessing the repository. - Consider renaming
get_repomaptoretrieve_repo_mapfor clarity.
This is awesome! I especially like the fact you made it a separate repo - maybe we can simply install it inside runtime and make it available to the agent via Bash 😀
Regarding better ways to use these - we could also introduce some RAG mechanism somewhere, so we can rerank files and display them based on previous user instruction / histories.
I'm trying to get https://github.com/All-Hands-AI/OpenHands/pull/4537 in first (so we have a really strong baseline for a capable LLM). Once that's finished- I'd be happy to help here to run eval ;)
Yep sounds good! Thanks for the idea, I'll have a closer look.
Let's close this one for now, since we will play around with this in the https://github.com/All-Hands-AI/openhands-aci repo.
RepoMap is now in openhands-aci, which is included in OH.
But does OH actually use/query the RepoMap to maintain existing software?
I always see OH running poor man's grep commands to find relevant code.
This critical TreeSitter feature seems a little abandoned. Can you please update?