maddpg icon indicating copy to clipboard operation
maddpg copied to clipboard

Two problem about update function

Open YuanyeMa opened this issue 5 years ago • 5 comments

I have two problem about update function after read your code. And could anyone explanation it for me? I am very appreciated. Firstly, I can't understand what role does variable "num_sample" play when train q network?

# train q network
num_sample = 1
target_q = 0.0
for i in range(num_sample):
      target_act_next_n = [agents[i].p_debug['target_act'](obs_next_n[i]) for i in range(self.n)]
      target_q_next = self.q_debug['target_q_values'](*(obs_next_n + target_act_next_n))
      target_q += rew + self.args.gamma * (1.0 - done) * target_q_next
target_q /= num_sample

Secondly, why the loss of p should be loss = pg_loss + p_reg * 1e-3, and what role does p_reg play in the loss.

YuanyeMa avatar May 30 '19 07:05 YuanyeMa

p_reg is regularizing p_values obtained from the following:
p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units). You can notice that p_values are being used to compute the pg_loss.

I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.

Ah31 avatar Aug 06 '19 13:08 Ah31

@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem num_sample = 1 is a bug? i think it should be num_sample = len(obs_next_n)?

KK666-AI avatar Nov 14 '19 06:11 KK666-AI

i think it's to take the expectation of the Bellman error target since you need to marginalize over next actions when evaluating the next q value

Justin-Yuan avatar Jan 15 '20 07:01 Justin-Yuan

@kevin-y-ma @Ah31 Could you reproduce the results. Also, it seem num_sample = 1 is a bug? i think it should be num_sample = len(obs_next_n)?

Sorry, I didn't find it when I first read it. Now I find the problem. In this code, there is a variable i with the same name. num_sample = 1 means experience replay only 1 sample

YuanBoXie avatar Mar 03 '22 12:03 YuanBoXie

p_reg is regularizing p_values obtained from the following: p = p_func(p_input, int(act_pdtype_n[p_index].param_shape()[0]), scope="p_func", num_units=num_units). You can notice that p_values are being used to compute the pg_loss.

I don't know yet why they have used num_sample because equalling it to 1 does not seem to do anything useful.

There is two variables both named i.

YuanBoXie avatar Mar 03 '22 14:03 YuanBoXie