Tensorflow-DeepMind-Atari-Deep-Q-Learner-2Player icon indicating copy to clipboard operation
Tensorflow-DeepMind-Atari-Deep-Q-Learner-2Player copied to clipboard

Pong2Player environment usage

Open wweichn opened this issue 6 years ago • 12 comments

wweichn avatar Apr 14 '18 08:04 wweichn

Hey, sorry for the late reply. We are still working on this project so it won't be complete for another month or so. I saw that you are interested in an API for 2 player pong game. You can actually look at the API of Xitari2Player, which I ported over into Python at https://github.com/choo8/Xitari2Player. The API calls there should be able to let you give actions to 2 players in Pong.

choo8 avatar Apr 17 '18 06:04 choo8

Hi, thanks for your response. I tried to run main_2.py in training branch , and if you edit code in 'history.py', its okay to run it. just change else sentence in get function to np.transpose(self.history, (1,2,0))

wweichn avatar Apr 19 '18 02:04 wweichn

Ok, let me know if you have more questions on getting the Pong2Player code running

choo8 avatar Apr 19 '18 02:04 choo8

thks, I am trying to rewrite a version based on your work, because I am not familiar with the use of ale etc. And I can't understand your definition of reward in agent.py def observe(self, screen, reward, action, terminal): reward = max(self.min_reward, min(self.max_reward, reward)) Do you mind explaining this to me. In my view, reward = reward is okay

wweichn avatar Apr 22 '18 13:04 wweichn

I believe this is useful if you want to perform clipping of rewards. You could also do reward = reward, it should work as well

choo8 avatar Apr 23 '18 01:04 choo8

The reward is gotten by ale.ale_getRewardA() or ale.ale_getRewardB(). Can you tell me the range of reward? And I have read

Multiagent cooperation and competition with deep reinforcement learning

In this essay the reward is between -1 and 1. Thks a lot.

wweichn avatar Apr 23 '18 01:04 wweichn

Yes, I believe it is within the range of -1 and 1. The values, decided by the rom used, should be as described in the paper "Multiagent Cooperation and Competition with Deep Reinforcement Learning".

choo8 avatar Apr 23 '18 06:04 choo8

sorry to bother you again. why the action is among [0,1,3,4] for agent 1 and [20,21,23,24] for agent2. I know there are four possible actions [None, fire, up, down]. Is this defined in roms/Pong2Player025.bin? Thks.

wweichn avatar Apr 24 '18 08:04 wweichn

This is actually an implementation of the Xitari2Player environment. You can see the full list of actions at https://github.com/choo8/Xitari2Player/blob/master/ale_interface.hpp. I only included the 4 relevant actions in the training script.

choo8 avatar Apr 24 '18 18:04 choo8

Thks. And where can I find the definition of ale.ale_isGameOver. I found it's hard to achieve a state where ale.ale_isGameOver is true. And by my observation, is it right that one side first got 20 points means Gameover of one epoch?

wweichn avatar Apr 29 '18 12:04 wweichn

According to the paper, a game of Pong ends when 21 points is scored by either agent. Epochs are determined by number of iterations, where 250000 iterations would equal to one epoch. This hyperparameter are also the ones used in the original paper.

choo8 avatar Apr 29 '18 17:04 choo8

Thks a lot. I think there might be some mistakes in your code about action. The range is between [20,21,23,24], take agent2 as example, but the output of network is [0,1,2,3], so it needs mapping from [0,1,2,3] to [20,21,23,24]. Take [0,1,2,3] as index of [20,21,23,24] is okay. And the exploration rate changes with agent.step, but at the beginning of a new epoch, the agent.step
starts from 0, but the exploration rate shouldn't begin from ep_start anymore, it should continue changing with the value of last step in the last epoch.

wweichn avatar May 02 '18 03:05 wweichn