poke-env icon indicating copy to clipboard operation
poke-env copied to clipboard

Selfplay?

Open akashsara opened this issue 2 years ago • 3 comments

Hi, So given the changes to poke-env, the old way of doing selfplay (examples/experimental-self-play.py) no longer works. I'm a little confused as to how I can do it now since there's no way of accepting a challenge in openai_api.py.

I essentially just want to be able to battle two agents inheriting from EnvPlayer. If I understand @MatteoH2O1999's comments on #306 correctly, this isn't possible without linking this to a specific ML library. What would be some generic boilerplate code for this?

akashsara avatar Aug 03 '22 21:08 akashsara

Hey @akashsara, in general you should have something like

challenge_task = asyncio.ensure_future(player1.agent.accept_challenges(player2.username, N_CHALLENGES))
for _ in range(N_CHALLENGES):
 step = player1.reset()
 while not step.done:
  action = model1.action(state)
  state = player1.step(action)
await challenge_task

on one thread and

challenge_task = asyncio.ensure_future(player2.agent.send_challenges(player1.username, N_CHALLENGES))
for _ in range(N_CHALLENGES):
 step = player2.reset()
 while not step.done:
  action = model2.action(state)
  state = player2.step(action)
await challenge_task

on another

MatteoH2O1999 avatar Aug 04 '22 01:08 MatteoH2O1999

Hi,

Can you guys add an example with this? It would be very helpful for those of us that don't have much coding skills yet would still like to enjoy making some agents :)

mancho2000 avatar Aug 05 '22 16:08 mancho2000

Thanks @MatteoH2O1999!

@mancho2000 this is what I ended up using for my implementation of self-play. Note that I coded my own implementation of DQN/PPO, so those bits might not match 1:1 with an existing library. It's also a bit different from what Matteo sent above since I wanted to train for N number of steps and not N battles but it shouldn't be too much work to change it if you want to:

async def battle_handler(player1, player2, num_challenges):
    await asyncio.gather(
        player1.agent.accept_challenges(player2.username, num_challenges),
        player2.agent.send_challenges(player1.username, num_challenges),
    )

def training_function(player, model, model_kwargs):
    # Fit (train) model as necessary.
    model.fit(player, **model_kwargs)
    player.done_training = True
    # Play out the remaining battles so both fit() functions complete
    # We use 99 to give the agent an invalid option so it's forced
    # to take a random legal action
    while player.current_battle and not player.current_battle.finished:
        _ = player.step(99)

if __name__ == "__main__":
    ...
    player1 = SimpleRLPlayer(
        battle_format="gen8randombattle",
        log_level=30,
        opponent="placeholder",
        start_challenging=False,
    )
    player2 = SimpleRLPlayer(
        battle_format="gen8randombattle",
        log_level=30,
        opponent="placeholder",
        start_challenging=False,
    )
    ...
    # Self-Play bits
    player1.done_training = False
    player2.done_training = False
    # Get event loop
    loop = asyncio.get_event_loop()
    # Make two threads: one per player and each runs model.fit()
    t1 = Thread(target=lambda: training_function(player1, ppo, p1_env_kwargs))
    t1.start()

    t2 = Thread(target=lambda: training_function(player2, ppo, p2_env_kwargs))
    t2.start()
    # On the network side, keep sending & accepting battles
    while not player1.done_training or not player2.done_training:
        loop.run_until_complete(battle_handler(player1, player2, 1))
    # Wait for thread completion
    t1.join()
    t2.join()

    player1.close(purge=False)
    player2.close(purge=False)
    ...

akashsara avatar Aug 10 '22 20:08 akashsara