webarena icon indicating copy to clipboard operation
webarena copied to clipboard

Could I do multi-thread evaluation?

Open Hodge931 opened this issue 1 year ago • 5 comments

To speed up the evaluation, I would like to evaluate, say 64 examples in parallel with multiple threads. Does this affect the correctness of the evaluation? Thanks a lot!

Hodge931 avatar Jul 21 '24 14:07 Hodge931

That may affect the results. The reason is that we deliberately design the order of examples so that former examples won't affect later examples.

This is the script for 4 parallel runs. You can also reset the environment more frequently to avoid the inter-example influence.

shuyanzhou avatar Jul 22 '24 19:07 shuyanzhou

Thanks a lot for the reply!

  1. In my understanding, with the reset environment, the evaluation of each example is correct. Therefore, I may set up two AWS instances, and evaluate, say examples 1-406 with instance 1, and examples 407-812 with instance 2. Is such evaluation correct?
  2. Sometimes errors may happen in the middle. For example, if the evaluation of the 10th example breaks down, could I just continue to evaluate the 11th example without re-evaluating the first 10 examples and without resetting environments?

Your kind suggestions are highly appreciated!

Hodge931 avatar Jul 22 '24 21:07 Hodge931

Hello! Do you mind elaborating on how the earlier tasks are dependent on later tasks? Is there any way to launch separate sites for each new task that we're evaluating so that we can run multiple agents at the same time? How often should the environment resets be happening? Thanks for you help :)

dryingpaint avatar Aug 05 '24 23:08 dryingpaint

Hello, do you have any advise on how to set up multiple dockers for the same website. For example, we can set up 10 shoping weisite with different port. So we can parallel evaluate it. Thank you!

junleiz avatar Aug 23 '24 13:08 junleiz

@shuyanzhou Hi, if I use the parallel script you provided, can I be sure that there is no inter-example influence.? Thanks for your clarification!

lrel7 avatar Mar 08 '25 01:03 lrel7