Q-learning is a loser?

Open StepHaze opened this issue 1 year ago • 6 comments

Hello, I'm trying to create a strong Mancala bot. I chose Q-learning: `# Let's do independent Q-learning in Mancala, and play it against random.

RL is based on python/examples/independent_tabular_qlearning.py

from open_spiel.python import rl_environment from open_spiel.python import rl_tools from open_spiel.python.algorithms import tabular_qlearner

Create the environment

env = rl_environment.Environment("mancala") num_players = env.num_players num_actions = env.action_spec()["num_actions"]

Create the agents

agents = [ tabular_qlearner.QLearner(player_id=idx, num_actions=num_actions) for idx in range(num_players) ]

Train the Q-learning agents in self-play.

for cur_episode in range(100000): if cur_episode % 1000 == 0: print(f"Episodes: {cur_episode}") time_step = env.reset() while not time_step.last(): player_id = time_step.observations["current_player"] agent_output = agents[player_id].step(time_step) time_step = env.step([agent_output.action])

Episode is over, step all agents with final info state.

for agent in agents: agent.step(time_step) `

And then a game against random agent: `# Evaluate the Q-learning agent against a random agent. from open_spiel.python.algorithms import random_agent eval_agents = [agents[0], random_agent.RandomAgent(1, num_actions, "Entropy Master 2000") ]

time_step = env.reset() while not time_step.last(): print("") print(env.get_state) player_id = time_step.observations["current_player"]

Note the evaluation flag. A Q-learner will set epsilon=0 here.

agent_output = eval_agents[player_id].step(time_step, is_evaluation=True) print(f"Agent {player_id} chooses {env.get_state.action_to_string(agent_output.action)}") time_step = env.step([agent_output.action])

print("") print(env.get_state) print(time_step.rewards)`

What really surprised me is that a TRAINED agent loses a random agent? How come? Could anyone explain me this please?

Feb 17 '24 07:02 StepHaze

open_spiel open_spiel copied to clipboard

Q-learning is a loser?

RL is based on python/examples/independent_tabular_qlearning.py

Create the environment

Create the agents

Train the Q-learning agents in self-play.

Episode is over, step all agents with final info state.

Note the evaluation flag. A Q-learner will set epsilon=0 here.

open_spiel
open_spiel copied to clipboard