on-policy icon indicating copy to clipboard operation
on-policy copied to clipboard

Shape of buffered log_probs

Open Maxtoq opened this issue 4 months ago • 0 comments

Hi,

I find something odd and I'd like to know if there's something I'm missing or if it's normal.

In the buffers, you define the action_log_probs to have "act_shape" as their last dimension (https://github.com/marlbenchmark/on-policy/blob/d53c4902cf2c291c93ced2c42c621371982ca2eb/onpolicy/utils/shared_buffer.py#L79C9-L80C100). With continuous actions, this means that the last dimension of action_log_probs would be the dimension of the action. But, the actual log probability of an action is just a single value. The model actually outputs one value for each action when actions are evaluated (and then we store them in an array of shape (ep_len, n_rollouts, n_agents, act_dim), which broadcasts the single value to act_dim).

Now, this actually doesn't cause any problem during training. So I guess you may have put this to fit the needs of other action spaces (maybe multidiscrete?). And I guess, for continuous actions only, I could replace the "act_shape" by "1" in the dimensions of action_log_probs in the buffer.

Have I understood this correctly? Or is there something I'm missing?

Thank you!

Maxtoq avatar Feb 19 '24 17:02 Maxtoq