smac Bug in Reward function

I found a bug in the reward function in the file "./env/starcraft2/starcraft2.py", line 729. The bug is that when the enemies heal or regenerate shield, the allies will receive rewards. The location of the bug is in the function "reward_battle(self)", Line 729:

In Line 729, the argument "delta_enemy" is negative when the enemy heals or regenerates shield. However, Line 729 uses abs() to convert "delta_enemy + delta_deaths" to a positive reward. This means the allies are rewarded when the enemy heals.

One consequence of this problem is that the allies may only learn to hurt the enemies but never kill them so that they can receive rewards when the enemies heal afterwards. In that case, a policy can learn to increase rewards but never win the game.

May 31 '21 02:05 hepengli

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.

if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

Jul 09 '21 15:07 douglasrizzo

Just to be clear, the problem you mention happens when delta_enemy is negative with a larger magnitude than delta_deaths, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths would be negative, but the absolute of that would be positive.

Since self.reward_only_positive means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths. It would behave more closely like the reward when self.reward_only_positive == False, but never being negative and ignoring ally damage and deaths.
if self.reward_only_positive:
    reward = max(0, delta_enemy + delta_deaths)  # shield regeneration
else:
    reward = delta_enemy + delta_deaths - delta_ally

Yes! And this situation happens especially in Maps with Protoss, which can regenerate shields. And I found in the map "3s5z_vs_3s6z" that the allies can learn a strategy to increase the reward without winning the game. Specifically, the allies can learn a pattern to injure the enemies a little and immediately run away from them by hiding in a corner and waiting the enemies to recover, and then repeat. However, the problem can be solved when modifying the reward function to this:

if self.reward_only_positive: reward = max(0, delta_enemy + delta_deaths) # shield regeneration else: reward = delta_enemy + delta_deaths - delta_ally

Jul 09 '21 15:07 hepengli

I sent a quick PR with this tiny change. Check if it solves the issues you have in your experiments.

Jul 09 '21 16:07 douglasrizzo

Thanks! I think this will do. I will check this again and get back to you soon.

Jul 09 '21 16:07 hepengli

Thanks both for pointing this out and for sending the PR https://github.com/oxwhirl/smac/pull/76. We're going to see how to best integrate this fix in the upcoming SMAC versions to avoid confusion. One issue we've noticed is that some people compare results using different SMAC/StarCraftII versions and report unfair comparisons between methods in their work.

Jul 19 '21 15:07 samvelyan

I have just watched a similar behavior happen in maps which have Medivacs (MMM and MMM2). They have a healing power and my agents have learned to wait for the Medivac to heal the last unit before killing it, sometimes at the expense of the match.

Jul 28 '21 03:07 douglasrizzo

Yes! I've also noticed this situation happens in MMM and MMM2. But your solution by changing the only positive reward to "reward = max(0, delta_enemy + delta_deaths)" is able to fix this issue.

Jul 28 '21 03:07 hepengli

smac smac copied to clipboard

Bug in Reward function

smac
smac copied to clipboard