Ryan H. Tran
Ryan H. Tran
> Hey, thanks a bunch for this @ryanhoangt ! > > I browsed through the code, and I think it's implemented quite well. Personally I think the next step could...
> It might be in the paper(s), but I don't quite like that the prompts now talk of `agent`, while anywhere else it is `assistant`. 🤔 Make sense, tho i...
@neubig Hi Prof., till now I tested on a few (13) swe-bench instances that are mutual between `swe-bench-lite` and `swe-bench-verified`, using max same 30 turns: - CoAct can resolve 8/13...
After running the eval on a subset of 93 instances, it's quite disappointing to see that the performance for now is pretty bad 😢. CoAct only resolved 25/93 while CodeAct...
> From the outside, it is a bit surprising that it is not at least equal in performance, imo. It's quite unexpected to me as well. I will upload the...
@ketan1741 @tobitege I uploaded the trajectory to my viz [here](https://huggingface.co/spaces/ryanhoangt/evaluation?filepaths=outputs%2Fswe_bench_lite%2FCoActPlannerAgent%2Fclaude-3-5-sonnet%4020240620_maxiter_40_N_v1.0-no-hint%2Foutput.jsonl), as uploading a subset of eval on OpenHands's official space may confuse people. Maybe we can have a look to...
@enyst thanks for the quick fix and the insightful comments. Re the `django__django-10914` instance, it's interesting to know there're some unclear specs here, I don't know which behavior I should...
Sounds good, I can do the debugging in the next few days and try running a new eval to obtain the trajectory of the executor. In previous eval, bugs such...
> I think it's visible when you look at the trajectories linked above, I'm looking now at the first of those 2, and step 9 is like: Re the json...
Yeah that makes sense, I can try doing that in the next run