godmoves/Poker_CFR: Counterfactual Regret Minimization for poker games

Poker_CFR

Counterfactual Regret Minimization for poker games.

CFR Algorithm

In essence, CFR is a regret matching procedure applied to optimize entity called immediate counterfactual regret.

Average overall regret

Let $\pi^\sigma(h)$ be the probability of history $h$ occurring if players choose actions according to $\sigma$ . The overall value to player $i$ of a strategy profile is then the expected payoff of the resulting terminal node:

$u_i(\sigma)=\sum_{h \in Z}{\pi^\sigma(h)u_i(h)}$

Let $\sigma_i^t$ be the strategy used by player $i$ on round $t$ . The average overall regret of player $i$ at time $T$ is:

$R_i^T = \frac{1}{T}\max_{\sigma^*_i \in \Sigma_i}{\sum_{t=1}^T(u_i(\sigma^*_i,\sigma^t_{-i})-u_i(\sigma^t))}$

Moreover, define $\bar{\sigma}_i^t$ to be the average strategy for player $i$ from time 1 to $T$ .

$\bar{\sigma}_i^t(I)(a)=\frac{\sum_{t=1}^T\pi_{i}^{\sigma^t}(I)\sigma^t(I)(a)}{\sum_{t=1}^T\pi_{i}^{\sigma^t}(I)}$

Counterfactual utility

For every opponent’s hand (game state $h$ ), we use the probability of reaching $h$ assuming we wanted to get to $h$ . So instead of using our regular strategy from strategy profile we modify it a bit so it always tries to reach our current game state $h$ – meaning that for each information set prior to currently assumed game state we pretend we always played pure behavioral strategy where the whole probability mass was placed in action that was actually played and led to current assumed state $h$ – which is in fact counterfactual, in opposition to facts, because we really played according to $\sigma$ . In practice then we just consider our opponent contribution to the probability of reaching currently assumed game state $h$ .

Formally, counterfactual utility for information set $I$ , player $i$ and strategy $\sigma$ is given by:

$u_i(\sigma, I) = \frac{\sum_{h \in I, h' \in Z}{\pi_{-i}^{\sigma}(h)\pi^{\sigma}(h,h')u_i(h')}}{\pi_{-i}^{\sigma}(I)}$

denominator is a sum of our counterfactual weights – normalizing constant.

You may find unnormalized form of the above – it is ok and let’s actually have it too, it will come in handy later:

$\hat{u}_i(\sigma, I) = \sum_{h \in I, h' \in Z}{\pi_{-i}^{\sigma}(h)\pi^{\sigma}(h,h')u_i(h')}$

Immediate Counterfactual Regret

To introduce counterfactual regret minimization we will need to look at the poker game from a certain specific angle. First of all we will be looking at single information set, single decision point. We will consider acting in this information set repeatedly over time with the goal to act in a best possible way with respect to certain reward measure.

Assuming we are playing as player $i$ , let’s agree that reward for playing action $a$ is unnormalized counterfactual utility under assumption we played action $a$ (let’s just assume this is how environment reward us). Entity in consideration is then defined as:

$\hat{u}_{i}(\sigma|_{I \to a}, I) = \sum_{h \in I, h' \in Z}{\pi_{-i}^{\sigma}(h)\pi^{\sigma}(ha, h')u_i(h')}$

where $ha$ is game state implied by playing action $a$ in game state $h$ . We can do it becaue we assume we played $a$ with probability 1.

We can define regret in our setting to be:

$R_{i,imm}^T(I) = \frac{1}{T}\max_{a \in A(I)}\sum_{t=1}^T(\hat{u}_{i}(\sigma^t|_{I \to a},I)-\hat{u}_i(\sigma^t,I))$

which called Immediate Counterfactual Regret.

Similarly, Immediate Counterfactual Regret of not playing action $a$ is given by:

$R_{i,imm}^T(I,a)=\frac{1}{T}\sum_{t=1}^T(\hat{u}_{i}(\sigma^t|_{I \to a},I)-\hat{u}_i(\sigma^t,I))$

For more information about these concepts, you can refer to the original paper and this post.

Poker_CFR
Poker_CFR copied to clipboard

Metadata

Poker_CFR

CFR Algorithm

Average overall regret

Counterfactual utility

Immediate Counterfactual Regret

← Metadata

Owner

Metadata

Poker_CFR Poker_CFR copied to clipboard

Metadata

Poker_CFR

CFR Algorithm

Average overall regret

Counterfactual utility

Immediate Counterfactual Regret

← Metadata

Owner

Metadata

Poker_CFR
Poker_CFR copied to clipboard