smac
smac copied to clipboard
Bug in Reward function
I found a bug in the reward function in the file "./env/starcraft2/starcraft2.py", line 729. The bug is that when the enemies heal or regenerate shield, the allies will receive rewards. The location of the bug is in the function "reward_battle(self)", Line 729:

In Line 729, the argument "delta_enemy" is negative when the enemy heals or regenerates shield. However, Line 729 uses abs() to convert "delta_enemy + delta_deaths" to a positive reward. This means the allies are rewarded when the enemy heals.
One consequence of this problem is that the allies may only learn to hurt the enemies but never kill them so that they can receive rewards when the enemies heal afterwards. In that case, a policy can learn to increase rewards but never win the game.
Just to be clear, the problem you mention happens when delta_enemy
is negative with a larger magnitude than delta_deaths
, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario, delta_enemy + delta_deaths
would be negative, but the absolute of that would be positive.
Since self.reward_only_positive
means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 and delta_enemy + delta_deaths
. It would behave more closely like the reward when self.reward_only_positive == False
, but never being negative and ignoring ally damage and deaths.
if self.reward_only_positive:
reward = max(0, delta_enemy + delta_deaths) # shield regeneration
else:
reward = delta_enemy + delta_deaths - delta_ally
Just to be clear, the problem you mention happens when
delta_enemy
is negative with a larger magnitude thandelta_deaths
, which would mean that agents are rewarded by injuring enemies, while not killing them. In this scenario,delta_enemy + delta_deaths
would be negative, but the absolute of that would be positive.Since
self.reward_only_positive
means we want to ignore negative rewards, I believe the best solution is to take the maximum of 0 anddelta_enemy + delta_deaths
. It would behave more closely like the reward whenself.reward_only_positive == False
, but never being negative and ignoring ally damage and deaths.if self.reward_only_positive: reward = max(0, delta_enemy + delta_deaths) # shield regeneration else: reward = delta_enemy + delta_deaths - delta_ally
Yes! And this situation happens especially in Maps with Protoss, which can regenerate shields. And I found in the map "3s5z_vs_3s6z" that the allies can learn a strategy to increase the reward without winning the game. Specifically, the allies can learn a pattern to injure the enemies a little and immediately run away from them by hiding in a corner and waiting the enemies to recover, and then repeat. However, the problem can be solved when modifying the reward function to this:
if self.reward_only_positive: reward = max(0, delta_enemy + delta_deaths) # shield regeneration else: reward = delta_enemy + delta_deaths - delta_ally
I sent a quick PR with this tiny change. Check if it solves the issues you have in your experiments.
Thanks! I think this will do. I will check this again and get back to you soon.
Thanks both for pointing this out and for sending the PR https://github.com/oxwhirl/smac/pull/76. We're going to see how to best integrate this fix in the upcoming SMAC versions to avoid confusion. One issue we've noticed is that some people compare results using different SMAC/StarCraftII versions and report unfair comparisons between methods in their work.
I have just watched a similar behavior happen in maps which have Medivacs (MMM and MMM2). They have a healing power and my agents have learned to wait for the Medivac to heal the last unit before killing it, sometimes at the expense of the match.
Yes! I've also noticed this situation happens in MMM and MMM2. But your solution by changing the only positive reward to "reward = max(0, delta_enemy + delta_deaths)" is able to fix this issue.