Software
Software copied to clipboard
New offensive gameplay architecture and Q-learning
Description
This PR overhauls our offensive gameplay architecture. The goal of the architecture redesign is to make our AI more dynamic and adaptive to the enemy we play against.
I recommend taking a look at the Gameplay Architecture RFC. Although it is very outdated in terms of the planned implementation and architecture, the goals and user stories provide some good background on the project.
Gameplay overview
classDiagram
direction TB
namespace Plays {
class DynamicPlay
class OffensePlay
}
<<abstract>> DynamicPlay
namespace Tactics {
class AttackerTactic
class SupportTactic
}
namespace Skills {
class ShootSkill
class KeepAwaySkill
class KickPassSkill
class ChipPassSkill
}
<<abstract>> SupportTactic
DynamicPlay <|-- OffensePlay : inherits
OffensePlay --> AttackerTactic
OffensePlay --> "many" SupportTactic
AttackerTactic --> ShootSkill
AttackerTactic --> KeepAwaySkill
AttackerTactic --> KickPassSkill
AttackerTactic --> ChipPassSkill
-
DynamicPlayis a base Play that selectsSupportTactics to assign. Support tactics play supporting roles on the field (e.g. going out to receiver positions, faking out enemy robots, etc.). Over time,DynamicPlayis supposed to learn the best support tactics to select given the state of the game.The implementation of
DynamicPlayand its support tactic selection algorithm may change in the future, and we currently only have one support tactic (ReceiverTactic), so I would not focus on reviewing these changes. -
OffensePlayis aDynamicPlaythat is run when we have possession. It assigns defensive tactics selected byDefensePlay, support tactics that are selected byDynamicPlay, and anAttackerTacticthat is the main ball handler during the play. -
There is now a clearer separation between Skills and Tactics. A Skill is smaller in scope and completes a single action (e.g. kick, chip, pass, dribble), while a Tactic has a greater set of responsibilities and objectives to complete. Skills have a similar interface to Tactics and are also implemented using FSMs that yield primitives. A Tactic can "execute" a Skill by forwarding the Skill's
updatePrimitiveresult in its ownupdatePrimitive.
Q-learning for attacker skill selection
The attacker uses a reinforcement learning algorithm called Q-learning to select and learn which skills (actions) to execute given the state of the World.
In Q-learning, the agent decides which action to take based on a Q-function $Q(s,a)$ that returns the expected reward for an action taken in a given state. After selecting some action $a$, we observe a reward $r$ and enter a new state $s'$ which are used to update the Q-function and adjust the Q-value given for taking action $a$ in state $s$. Since our state space is extremely large, we use linear Q-function approximation to estimate $Q(s,a)$ even if we have not previously applied action $a$ in state $s$.
The Q-function weights are logged as protobufs and displayed in a new Q-learning widget in Thunderscope. Each weight in the table is associated with a feature and an action. Columns represent the features in the order they are initialized in AttackerMdpFeatureExtractor, and rows represent the actions in the order they are defined in AttackerMdpAction.
We can save the weights to a CSV file (they are also automatically written to a CSV under /tmp/tbots) and we can load in an initial set of weights when starting the AI (attacker_mdp_q_function_weights.csv).
Other changes
- Changed most Tactics and all Plays to accept a shared instance of a
Strategyclass instead ofTbotsProto::AiConfig. TheStrategycontains shared gameplay calculations and has agetAiConfig()method that returns the latestTbotsProto::AiConfig. - Updated
SensorFusionto track the distance that the ball has been continuously dribbled by the friendly team. This dribble distance is output in theWorlds thatSensorFusionproduces. We use this information to limit how farDribbleSkillcan dribble, so that even if multiple dribbling skills are executed sequentially, we will avoid going over the max dribble distance. - Changed
PossessionTrackerto match original CMDragons possession algorithm more closely. There are now 4 types of possession (FRIENDLY, ENEMY, IN_CONTEST, LOOSE). To make our gameplay more aggressive,DefensePlayis only run when we are in ENEMY possession; otherwise,OffensePlayis run. - Probably more changes that I can't remember...
Testing Done
- Based on the eye test, everything works and looks OK. We ran the new gameplay for an extended period of time during the scrimmage, and I don't think there were any major hitches or crashes.
- Existing simulated gameplay tests have been updated and all pass
- Tactics converted to Skills have had their tests updated and all pass
Resolved Issues
Resolves #3080 Resolves #3079 Resolves #3078 Resolves #3077 Resolves #3076 Resolves #3074 Resolves #3073 Resolves #3071 Resolves #3070 Resolves #3069 Resolves #3065 Resolves #2514 Resolves #3083 Resolves #3081 Resolves #3072 Resolves #3219 Resolves #3203 Resolves #3216 Resolves #3156 Resolves #3233 Resolves #2930 Resolves #2868 Resolves #2643 Resolves #3098 Resolves #3097 Resolves #3096 Resolves #3082 Resolves #2134
Length Justification and Key Files to Review
software/ai/evaluation/q_learningsoftware/ai/hl/stp/skillsoftware/ai/hl/stp/play/dynamic_playssoftware/ai/hl/stp/tactic/attacker
Review Checklist
It is the reviewers responsibility to also make sure every item here has been covered
- [ ] Function & Class comments: All function definitions (usually in the
.hfile) should have a javadoc style comment at the start of them. For examples, see the functions defined inthunderbots/software/geom. Similarly, all classes should have an associated Javadoc comment explaining the purpose of the class. - [ ] Remove all commented out code
- [ ] Remove extra print statements: for example, those just used for testing
- [ ] Resolve all TODO's: All
TODO(or similar) statements should either be completed or associated with a github issue
I think I've addressed all of the comments so far. Spent most of today fixing merge conflicts
- Added a bar graph to the Q-learning widget that shows the softmax probability distribution for the actions + the selected action highlighted in blue:
- Swapped the KickPass and ChipPass weights associated with the pass rating feature (the ChipPass weight was larger before), so now we chip less frequently
- PassSkill and ShootSkill now abort bad passes and shots, which seems to work well
Just real quick tried the recent changes, and the probability bar chart looks awesome!! I'm amazed by how often and how significantly it changes between episodes. I was expecting the changes to be minor. On first glance the gameplay does seem even better, with the passes and shots being more consistent. Though every now and then I do notice the attacker trying to find a pass and not being able to because of the enemy pass defenders. In these scenarios, it might make sense to run keep away to go to a better passing position (poor naming, not really "keeping away"), while looking for a pass. Not sure if this is achievable given that Keep Away is a separate skill from passing, though I guess we could include it in the passing skill FSMs to look for a better pass if it doesn't take a pass within x-seconds.
I will try to spend some time in the morning looking through the code as well.
Some food for thought. The features we're currently using give a very limited perspective in how the enemy robots are playing and how they are positioned relative to us. I wonder if we should include other features such as, max or sum of the enemy threat ratings. With this, the offensive robot may decide to run keep away instead of passing/shooting. Though to be fair, Im not really sure how the threat ratings work and if its value makes sense from the perspective of the attacker deciding what skill to choose.
- Disabled
ChipPassSkillandDribbleShootSkill - Added
nearbyEnemyThreatsFeaturethat represents the number of nearby enemy robots that could steal the ball PassSkillFSMnow dribbles the ball to a better passing position usingfindKeepAwayTargetPointif it cannot find a "perfect" pass- Fixed bug in dribble distance tracker in
SensorFusionwhereisNearDribblerwould flicker on and off even if the ball was in the dribbler - Tuned initial weights for 5v5
The diff will be pretty unreadable since I merged in #3292
I also merged in some stuff from robocup, including @mkhlb's new ball placement play which required a lot of changes to work with the new gameplay. So I suggest we close #3243 and have any more changes to ball placement be committed directly to this new_gameplay_staging branch
Closing #3243 is good with me. I think there are some stray comments in that PR that we can make tickets for.
Succeeded by #3415