Competition_3v3snakes icon indicating copy to clipboard operation
Competition_3v3snakes copied to clipboard

A bug will lead to a failed greedy policy 训练代码有一个bug会使得greedy策略失效

Open COMoER opened this issue 3 years ago • 1 comments

我们在使用该仓库代码rl_trainer/main.py进行我们模型的训练时,发现了一个代码中的bug

We have found a bug when we are using your code rl_trainer/main.py to train our own model

	    # …… 58-66
	    state_to_training = state[0] # you have defined state_to_training here
	    # …… 68-78
            while True:
            	# …… 80-86
                actions = logits_greedy(state_to_training, logits, height, width) # Here you use state_to_train to generate greedy policy
				# …… 87-90 
				next_state, reward, done, _, info = env.step(env.encode(actions))
				next_state_to_training = next_state[0] # create new varible next_state_to_training
				next_obs = get_observations(next_state_to_training, ctrl_agent_index, obs_dim, height, width)
				# …… 90-116
				model.replay_buffer.push(obs, logits, step_reward, next_obs, done)

				model.update()

				obs = next_obs
				step += 1
				# …… 123-146

代码里面定义了state_to_training,greedy策略也是使用state_to_training作为观测,但是后续代码并未将更新后的状态next_state_to_training赋给state_to_training,使得greedy策略一直观测的是开始时的状态。当然,对于我们自己模型的训练并没有影响,因为get_observations用的是next_state_to_training。但这个bug会使得greedy策略失效,有可能比random还差

You have define state_to_training at the beginning of the code, which is above the loop of a training episode. During the one episode training, you have usedstate_to_training as an observation for greedy policy. But, you haven't updated state_to_training using the updated state next_state_to_training, which would make the greedy policy continuously observing the state at the very beginning. Of course, it doesn't matter the training of our own model, because the argument passing to get_observations is next_state_to_training. We suppose that such a bug will make the greedy policy failed, maybe worse than random policy.

所以应该在更新obs的时候也更新state_to_training

The supposed code to fix the bug is as following:

	    # …… 58-66
	    state_to_training = state[0] # you have defined state_to_training here
	    # …… 68-78
            while True:
            	# …… 80-86
                actions = logits_greedy(state_to_training, logits, height, width) # Here you use state_to_train to generate greedy policy
				# …… 87-90 
				next_state, reward, done, _, info = env.step(env.encode(actions))
				next_state_to_training = next_state[0] # create new varible next_state_to_training
				next_obs = get_observations(next_state_to_training, ctrl_agent_index, obs_dim, height, width)
				# …… 90-116
				model.replay_buffer.push(obs, logits, step_reward, next_obs, done)

				model.update()

				obs = next_obs
				state_to_training = next_state_to_training
				step += 1
				# …… 123-146

COMoER avatar Dec 19 '21 07:12 COMoER

Hello there,

感谢提问。你提出的方案确实是正确,可以直接提一个pull request^^

Thank you for asking. Your proposed correction has been checked and you could directly submit a request. ^^

Thanks again. Yutong

Yutongamber avatar Dec 21 '21 03:12 Yutongamber