drl-5g-scheduler
drl-5g-scheduler copied to clipboard
small command issue
Hi there! I'm trying to reproduce your code and find a small issue in the off-line training setup. Hope it's helpful. PYTHONPATH=./ python3 ./sim_script_example/ka.py instead of PYTHONPATH=./ python3 ./sim_sript_example/ka.py
By the way, I feel it's a little confusing when the thread is involved, especially asynchronization=False. Do you have any suggestion to debug the program by using breakpoints? Thank you very much.
Hi Hanghoo, the thread (or setting asynchronization = True) is only useful in online experiments. When set to false in the offline experiment, the algorithm will behave sequentially as "generate one transition, save the transition, update one training step, generate one sample, ... " repeat, which is achieved by mutex locks. (this follows the algorithm flow in our paper)
Hi Zhouyou, thank you very much for your response. Yes, I found the mutex locks of step() and sample(). May I know what the function of _per_w_multiplier() is? Also, if you can give me some suggestions about Multi-head critic implementation, that would be very helpful. Thank you very much.
Hi hanghoo, _per_w_multiplier() is a function to adjust the weight of each sample (or each transition) according to the delay of each user's queue. This is used to achieve the importance sample. You can find the math expressions in our paper. I am not sure whether queue delay is a state feature in your applications. It can be configured according to state features in your applications. Also, for the multi-head critic, you can view it as several critics in parallel, where each user has one critic. You can find the expression in our paper as well.
Hi Zhouyou, thank you very much for your detailed explanation. That's very helpful. I have read your paper.
On the other hand, I guess the weights
in this line l_critic = torch.mul(l_critic_per_batch, weights)
is the current weight of importance sampling. _per_w_multiplier()
is used for adjusting the weight of importance sampling for the following batch calculation, right? Thank you.
Yes, that's correct. Or, more precisely, weights in l_critic = torch.mul(l_critic_per_batch, weights) and l_actor = torch.mul(l_actor_per_batch, weights) is the correction of bias caused by importance sampling. ret_per_e = to_numpy(l_critic); ret_per_e = ret_per_e * self._per_w_multiplier(batch) sets the weight of each transition for the following iterations. Details can be found in our paper. The terms (or variable names) may not be well-linked between codes and the paper.
Thank you very much for your answers.
- Do you think the multi-head critic architecture can transplant to the Auto entropy SAC? If you can share any references about multi-head critic, that will be very helpful.
- Compared with
class DDPG
andclass MultiHeadCriticDDPG_NEW_PER
, the difference is whether considering the_per_w_multiplier
. Therefore, regarding the normal replay memory, is there any difference between single and multiple heads? Thank you very much.
Hi, hanghoo. For 1, I did not use SAC before, so I do not know about it. For 2, no difference.
Hi @zhouyou-gu, thank you very much for all your help.
Hi @zhouyou-gu, a quick question. Is there any reference to support multi-head critic architecture?