Poker_CFR
Poker_CFR copied to clipboard
Counterfactual Regret Minimization for poker games
Poker_CFR
Counterfactual Regret Minimization for poker games.
CFR Algorithm
In essence, CFR is a regret matching procedure applied to optimize entity called immediate counterfactual regret.
Average overall regret
Let be the probability of history
occurring if players choose actions according to
. The overall value to player
of a strategy profile is then the expected payoff of the resulting terminal node:
Let be the strategy used by player
on round
. The average overall regret of player
at time
is:
Moreover, define to be the average strategy for player
from time 1 to
.
Counterfactual utility
For every opponent’s hand (game state ), we use the probability of reaching
assuming we wanted to get to
. So instead of using our regular strategy from strategy profile we modify it a bit so it always tries to reach our current game state
– meaning that for each information set prior to currently assumed game state we pretend we always played pure behavioral strategy where the whole probability mass was placed in action that was actually played and led to current assumed state
– which is in fact counterfactual, in opposition to facts, because we really played according to
. In practice then we just consider our opponent contribution to the probability of reaching currently assumed game state
.

Formally, counterfactual utility for information set , player
and strategy
is given by:
denominator is a sum of our counterfactual weights – normalizing constant.
You may find unnormalized form of the above – it is ok and let’s actually have it too, it will come in handy later:
Immediate Counterfactual Regret
To introduce counterfactual regret minimization we will need to look at the poker game from a certain specific angle. First of all we will be looking at single information set, single decision point. We will consider acting in this information set repeatedly over time with the goal to act in a best possible way with respect to certain reward measure.
Assuming we are playing as player , let’s agree that reward for playing action
is unnormalized counterfactual utility under assumption we played action
(let’s just assume this is how environment reward us). Entity in consideration is then defined as:
where is game state implied by playing action
in game state
. We can do it becaue we assume we played
with probability 1.
We can define regret in our setting to be:
which called Immediate Counterfactual Regret.
Similarly, Immediate Counterfactual Regret of not playing action is given by:
For more information about these concepts, you can refer to the original paper and this post.