HARL Modifying Continuous Action Output with Softmax in MAPPO/HAPPO

Modifying Continuous Action Output with Softmax in MAPPO/HAPPO

Open georgewanglz2019 opened this issue 8 months ago • 8 comments

Hello,

Thank you for sharing your code, it has been incredibly useful!

I am currently trying to use your MAPPO or HAPPO to run my tasks where my actions are n-dimensional continuous actions. These actions need to sum up to 1, with each value being greater than or equal to 0 and less than or equal to 1.

To achieve this, I modified your continuous action code by adding a softmax function in the last layer. Specifically, in the forward function of act.py, I added a line after the continuous action code as follows:

python

actions = (
    action_distribution.mode()
    if deterministic
    else action_distribution.sample()
)
# Added line:
actions = torch.softmax(actions, dim=-1)

However, after training for several epochs, the neural network seems to be learning incorrectly, often outputting a large number of NaNs, which causes the program to terminate. I researched online and found that it might be due to gradient explosion or some other issue. Changing the activation function to tanh didn't help much, as it just delayed the occurrence of NaNs by a few more epochs.

Based on the above problem, I would like to seek your advice on how to modify the code to achieve the desired continuous actions. If you are interested, I would greatly appreciate your time and assistance or any ideas you could share.

Thank you very much for your help!

Best regards, George Wang

Jul 01 '24 09:07 georgewanglz2019

HARL HARL copied to clipboard

Modifying Continuous Action Output with Softmax in MAPPO/HAPPO

HARL
HARL copied to clipboard