on-policy
on-policy copied to clipboard
Shape of buffered log_probs
Hi,
I find something odd and I'd like to know if there's something I'm missing or if it's normal.
In the buffers, you define the action_log_probs to have "act_shape" as their last dimension (https://github.com/marlbenchmark/on-policy/blob/d53c4902cf2c291c93ced2c42c621371982ca2eb/onpolicy/utils/shared_buffer.py#L79C9-L80C100). With continuous actions, this means that the last dimension of action_log_probs would be the dimension of the action. But, the actual log probability of an action is just a single value. The model actually outputs one value for each action when actions are evaluated (and then we store them in an array of shape (ep_len, n_rollouts, n_agents, act_dim), which broadcasts the single value to act_dim).
Now, this actually doesn't cause any problem during training. So I guess you may have put this to fit the needs of other action spaces (maybe multidiscrete?). And I guess, for continuous actions only, I could replace the "act_shape" by "1" in the dimensions of action_log_probs in the buffer.
Have I understood this correctly? Or is there something I'm missing?
Thank you!