AgentTuning icon indicating copy to clipboard operation
AgentTuning copied to clipboard

AgentTuning 7b evaluate in HH, not expect as paper result

Open Dhaizei opened this issue 2 years ago • 17 comments

https://huggingface.co/THUDM/agentlm-7b , I try it,but far below 84% in alfworld-std. Is it the wrong model?

Dhaizei avatar Nov 06 '23 07:11 Dhaizei

{ "total": 50, "validation": { "running": 0.0, "completed": 0.1, "agent context limit": 0.0, "agent validation failed": 0.0, "agent invalid action": 0.62, "task limit reached": 0.28, "unknown": 0.0, "task error": 0.0, "average_history_length": 62.22, "max_history_length": 91, "min_history_length": 20 }, "custom": { "overall": { "total": 50, "pass": 5, "wrong": 45, "success_rate": 0.1 } } }

Dhaizei avatar Nov 06 '23 08:11 Dhaizei

Your output seems like there may be a mismatch in the evaluation setup you've used. Please ensure that you're using the evaluation code from ./AgentBench.old as mentioned in README, not the latest repo THUDM/AgentBench. Could you kindly provide your trajectories for a thorough review?

lr-tsinghua11 avatar Nov 06 '23 11:11 lr-tsinghua11

Yes, when I use the latest version of them, where do I send the trajectory information?

Dhaizei avatar Nov 13 '23 03:11 Dhaizei

But I can get to 0.84 with gpt-4

{ "total": 50, "validation": { "running": 0.0, "completed": 0.84, "agent context limit": 0.0, "agent validation failed": 0.0, "agent invalid action": 0.04, "task limit reached": 0.12, "unknown": 0.0, "task error": 0.0, "average_history_length": 50.56, "max_history_length": 91, "min_history_length": 21 }, "custom": { "overall": { "total": 50, "pass": 42, "wrong": 8, "success_rate": 0.84 } } }

Dhaizei avatar Nov 13 '23 03:11 Dhaizei

here is my trajectories for a thorough review in HH.
链接:https://pan.baidu.com/s/1Np291cysxDQDozzr4RiJDQ?pwd=1ijk 提取码:1ijk

Dhaizei avatar Nov 13 '23 05:11 Dhaizei

As mentioned in https://github.com/THUDM/AgentTuning#held-in-tasks

The 6 held-in tasks are selected from AgentBench. However, since AgentBench is still under active development, the results from the latest branch might not fully reproduce the results reported in the paper. The evaluation code of this project is located in ./AgentBench.old.

Please use the AgentBench.old directory at AgentBench.old for Agent task evaluation.

lr-tsinghua11 avatar Nov 16 '23 10:11 lr-tsinghua11

But it's just a lot below the latest Agentbench test. a bit unexpected. Make sure that the uploaded model is okay.

Dhaizei avatar Nov 17 '23 02:11 Dhaizei

How much epoch have you trained?

Dhaizei avatar Nov 17 '23 03:11 Dhaizei

How much epoch have you trained?

The models are trained for 2k steps, batch size 64, sequence length 4096 with packing.

Btlmd avatar Nov 19 '23 17:11 Btlmd

I use fastchat to fine tune llama2, but the effect was not very ideal. Can you use fastchat to achieve the effect of the paper after fine tuning? Although the batch size I set is not very large at 2, the improvement in completing tasks after fine-tuning is not significant. Do you have any good suggestions? In addition, chatglm3-6B can reach 64% in HH tasks, which also proves the effectiveness of AgentTuning.

Dhaizei avatar Nov 20 '23 05:11 Dhaizei

in addtion , one of AgentInstruct data is invalid : { "conversations": [ { "from": "human", "loss": false, "value": "'''\n

Contact
\n'''\n\nBased on the HTML webpage above, try to complete the following task:\nTask: Schedule a demo drive for Model Y for Roy Adams with phone number 123-999-0000, email address [email protected] and zip code 90001 in the United States.\nPrevious actions:\n[link] Demo Drive -> CLICK\n[button] Model Y -> CLICK\n[textbox] Last Name -> TYPE: Adams\n[textbox] First Name -> TYPE: Roy\n[textbox] Phone Number -> TYPE: 123-999-0000\nWhat should be the next action? Please select from the following choices (If the correct action is not in the page above, please select A. 'None of the above'):\n\nA. None of the above\nB. \nC. \nD. \nE. \nF. Contact " }, { "from": "gpt", "loss": true, "value": "" } ], "id": "mind2web_60" }

Dhaizei avatar Nov 20 '23 08:11 Dhaizei

Since I achieved poor results after fine-tuning with FastChat, I intend to further improve its capabilities by increasing the dataset size. The approach of expanding the dataset size by using the training data from the AlfWorld dataset , and then evaluating it. Can this approach be effective? Could you provide some advice?

Dhaizei avatar Nov 21 '23 07:11 Dhaizei

Is alfworld's prompt "alfworld_multiturn_new.json" better than "alfworld_multiturn_react.json"?

Dhaizei avatar Nov 21 '23 07:11 Dhaizei

@Dhaizei 哈喽,我最近也在看alfworld,我发现直接用agent LM 7b 在原版alfworld上测https://github.com/alfworld/alfworld, 134个eval 的task只成功了一个,这个结果我看和paper里面的相去甚远。我看你之前也问过作者们类似的问题,想问一下你复现的结果是多少?十分感谢

YSLIU627 avatar Jan 24 '25 15:01 YSLIU627

可能需要看一下提示词,适配一下,目前的话,好久没有测了,估计用glm或者qwen的效果会比之前他们公布的要好很多。很多模型都具备了思考和规划能力,并且很强,比如deepseekR1

Dhaizei avatar Jan 25 '25 03:01 Dhaizei

@Dhaizei 我都测了 从qwen2.5 7b instruct到R1的7b distill version,原版alfworld上134个eval task全部fail(agent LM 7b好歹还做对了一个hh)。感谢感谢,我去看看prompt

YSLIU627 avatar Jan 25 '25 03:01 YSLIU627

之前我看过交互的过程,大多数错的都是很显而易见的思考过程,试过r1和qwen2.5之后,我感觉这几个task应该可以轻松拿下😂,看看具体是什么错误,可以把交互的过程记录下来,是模型本身问题还是提示词不当的原因

Dhaizei avatar Jan 25 '25 03:01 Dhaizei