Software New offensive gameplay architecture and Q-learning

trafficstars

Description

This PR overhauls our offensive gameplay architecture. The goal of the architecture redesign is to make our AI more dynamic and adaptive to the enemy we play against.

I recommend taking a look at the Gameplay Architecture RFC. Although it is very outdated in terms of the planned implementation and architecture, the goals and user stories provide some good background on the project.

Gameplay overview

classDiagram
    direction TB

    namespace Plays {
        class DynamicPlay
        class OffensePlay
    }

    <<abstract>> DynamicPlay

    namespace Tactics {
        class AttackerTactic
        class SupportTactic
    }
    
    namespace Skills {
        class ShootSkill
        class KeepAwaySkill
        class KickPassSkill
        class ChipPassSkill
    }

    <<abstract>> SupportTactic


    DynamicPlay <|-- OffensePlay : inherits
    OffensePlay --> AttackerTactic
    OffensePlay --> "many" SupportTactic

    AttackerTactic --> ShootSkill
    AttackerTactic --> KeepAwaySkill 
    AttackerTactic --> KickPassSkill 
    AttackerTactic --> ChipPassSkill

DynamicPlay is a base Play that selects SupportTactics to assign. Support tactics play supporting roles on the field (e.g. going out to receiver positions, faking out enemy robots, etc.). Over time, DynamicPlay is supposed to learn the best support tactics to select given the state of the game.

The implementation of DynamicPlay and its support tactic selection algorithm may change in the future, and we currently only have one support tactic (ReceiverTactic), so I would not focus on reviewing these changes.
OffensePlay is a DynamicPlay that is run when we have possession. It assigns defensive tactics selected by DefensePlay, support tactics that are selected by DynamicPlay, and an AttackerTactic that is the main ball handler during the play.
There is now a clearer separation between Skills and Tactics. A Skill is smaller in scope and completes a single action (e.g. kick, chip, pass, dribble), while a Tactic has a greater set of responsibilities and objectives to complete. Skills have a similar interface to Tactics and are also implemented using FSMs that yield primitives. A Tactic can "execute" a Skill by forwarding the Skill's updatePrimitive result in its own updatePrimitive.

Q-learning for attacker skill selection

The attacker uses a reinforcement learning algorithm called Q-learning to select and learn which skills (actions) to execute given the state of the World.

In Q-learning, the agent decides which action to take based on a Q-function $Q(s,a)$ that returns the expected reward for an action taken in a given state. After selecting some action $a$, we observe a reward $r$ and enter a new state $s'$ which are used to update the Q-function and adjust the Q-value given for taking action $a$ in state $s$. Since our state space is extremely large, we use linear Q-function approximation to estimate $Q(s,a)$ even if we have not previously applied action $a$ in state $s$.

Q-learning

The Q-function weights are logged as protobufs and displayed in a new Q-learning widget in Thunderscope. Each weight in the table is associated with a feature and an action. Columns represent the features in the order they are initialized in AttackerMdpFeatureExtractor, and rows represent the actions in the order they are defined in AttackerMdpAction.

We can save the weights to a CSV file (they are also automatically written to a CSV under /tmp/tbots) and we can load in an initial set of weights when starting the AI (attacker_mdp_q_function_weights.csv).

Other changes

Changed most Tactics and all Plays to accept a shared instance of a Strategy class instead of TbotsProto::AiConfig. The Strategy contains shared gameplay calculations and has a getAiConfig() method that returns the latest TbotsProto::AiConfig.
Updated SensorFusion to track the distance that the ball has been continuously dribbled by the friendly team. This dribble distance is output in the Worlds that SensorFusion produces. We use this information to limit how far DribbleSkill can dribble, so that even if multiple dribbling skills are executed sequentially, we will avoid going over the max dribble distance.
Changed PossessionTracker to match original CMDragons possession algorithm more closely. There are now 4 types of possession (FRIENDLY, ENEMY, IN_CONTEST, LOOSE). To make our gameplay more aggressive, DefensePlay is only run when we are in ENEMY possession; otherwise, OffensePlay is run.
Probably more changes that I can't remember...

Testing Done

Based on the eye test, everything works and looks OK. We ran the new gameplay for an extended period of time during the scrimmage, and I don't think there were any major hitches or crashes.
Existing simulated gameplay tests have been updated and all pass
Tactics converted to Skills have had their tests updated and all pass

Resolved Issues

Resolves #3080 Resolves #3079 Resolves #3078 Resolves #3077 Resolves #3076 Resolves #3074 Resolves #3073 Resolves #3071 Resolves #3070 Resolves #3069 Resolves #3065 Resolves #2514 Resolves #3083 Resolves #3081 Resolves #3072 Resolves #3219 Resolves #3203 Resolves #3216 Resolves #3156 Resolves #3233 Resolves #2930 Resolves #2868 Resolves #2643 Resolves #3098 Resolves #3097 Resolves #3096 Resolves #3082 Resolves #2134

Length Justification and Key Files to Review

software/ai/evaluation/q_learning
software/ai/hl/stp/skill
software/ai/hl/stp/play/dynamic_plays
software/ai/hl/stp/tactic/attacker

Review Checklist

It is the reviewers responsibility to also make sure every item here has been covered

[ ] Function & Class comments: All function definitions (usually in the .h file) should have a javadoc style comment at the start of them. For examples, see the functions defined in thunderbots/software/geom. Similarly, all classes should have an associated Javadoc comment explaining the purpose of the class.
[ ] Remove all commented out code
[ ] Remove extra print statements: for example, those just used for testing
[ ] Resolve all TODO's: All TODO (or similar) statements should either be completed or associated with a github issue

Jul 08 '24 04:07 williamckha

I think I've addressed all of the comments so far. Spent most of today fixing merge conflicts

Added a bar graph to the Q-learning widget that shows the softmax probability distribution for the actions + the selected action highlighted in blue:

Swapped the KickPass and ChipPass weights associated with the pass rating feature (the ChipPass weight was larger before), so now we chip less frequently
PassSkill and ShootSkill now abort bad passes and shots, which seems to work well

Jul 10 '24 06:07 williamckha

Just real quick tried the recent changes, and the probability bar chart looks awesome!! I'm amazed by how often and how significantly it changes between episodes. I was expecting the changes to be minor. On first glance the gameplay does seem even better, with the passes and shots being more consistent. Though every now and then I do notice the attacker trying to find a pass and not being able to because of the enemy pass defenders. In these scenarios, it might make sense to run keep away to go to a better passing position (poor naming, not really "keeping away"), while looking for a pass. Not sure if this is achievable given that Keep Away is a separate skill from passing, though I guess we could include it in the passing skill FSMs to look for a better pass if it doesn't take a pass within x-seconds.

I will try to spend some time in the morning looking through the code as well.

Some food for thought. The features we're currently using give a very limited perspective in how the enemy robots are playing and how they are positioned relative to us. I wonder if we should include other features such as, max or sum of the enemy threat ratings. With this, the offensive robot may decide to run keep away instead of passing/shooting. Though to be fair, Im not really sure how the threat ratings work and if its value makes sense from the perspective of the attacker deciding what skill to choose.

Jul 10 '24 08:07 nimazareian

Disabled ChipPassSkill and DribbleShootSkill
Added nearbyEnemyThreatsFeature that represents the number of nearby enemy robots that could steal the ball
PassSkillFSM now dribbles the ball to a better passing position using findKeepAwayTargetPoint if it cannot find a "perfect" pass
Fixed bug in dribble distance tracker in SensorFusion where isNearDribbler would flicker on and off even if the ball was in the dribbler
Tuned initial weights for 5v5

Jul 10 '24 21:07 williamckha

The diff will be pretty unreadable since I merged in #3292

I also merged in some stuff from robocup, including @mkhlb's new ball placement play which required a lot of changes to work with the new gameplay. So I suggest we close #3243 and have any more changes to ball placement be committed directly to this new_gameplay_staging branch

Aug 08 '24 06:08 williamckha

Closing #3243 is good with me. I think there are some stray comments in that PR that we can make tickets for.

Aug 09 '24 01:08 itsarune

New offensive gameplay architecture and Q-learning

Sep 21 '24 03:09 notion-workspace[bot]

Succeeded by #3415

Jan 25 '25 05:01 williamckha

Software Software copied to clipboard

New offensive gameplay architecture and Q-learning

Description

Gameplay overview

Q-learning for attacker skill selection

Other changes

Testing Done

Resolved Issues

Length Justification and Key Files to Review

Review Checklist

Software
Software copied to clipboard