Regret-Guided Search Control for Efficient Learning in AlphaZero

Abstract

Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at https://rlg.iis.sinica.edu.tw/papers/rgsc.

RGSC

RGSC extends AlphaZero by identifying and prioritizing high-regret states as search control openings for self-play in board games. RGSC guides self-play to begin from states with higher regret, where regret reflects positions that the current agent has not yet mastered. Several key components in RGSC are described as follows:

Regret Function:
Regret Network:

regret value head

regret ranking head

Prioritized Regret Buffer (PRB):

Experiment Results

Toy Example

We first investigate search control in a toy environment, an \(n\)-level sparse-reward binary tree, where each leaf node is assigned an expected reward value p ∈ [0, 1]. Experiment results demonstrate the importance of prioritizing states with high learning potential and show the effectiveness of the regret-guided search control.

n = 5, Rewards

n = 6, Rewards

RGSC in Board Games

We compare RGSC against two baseline methods three board games (9x9 Go, 10x10 Othello, and 11x11 Hex):

AlphaZero: trained without search control
Go-Exploit: trained with GEVC version

The results in below figures show that RGSC consistently outperforms both baselines. In 9x9 Go, RGSC surpasses AlphaZero and Go-Exploit by 76 and 96 Elo points, respectively; in 10x10 Othello, the improvements are 70 and 50 Elo points; and in 11x11 Hex, the differences are 84 and 122 Elo points.

9x9 Go

10x10 Othello

11x11 Hex

The table below shows the win rate by RGSC against other baseline programs, including KataGo (Go), Ludii (Othello), and MoHex (Hex). The results show that RGSC consistently outperforms both AlphaZero and Go-Exploit.

RGSC on Well-Trained Models

We demonstrate that RGSC can still improve when training from an already well-trained model. We select a large 15-block baseline model trained with the AlphaZero algorithm on 9x9 Go, which already achieves a strong playing strength. We then continue training this model to evaluate the effectiveness of RGSC in training well-trained models.

We also compare against a KataGo model of the same block size; the baseline achieves a win rate of 69.3%. After training, RGSC further improves the win rate to 78.2%, while Go-Exploit remains at 69.2% and AlphaZero achieves 70.2%.

Regret Change in Prioritized Regret Buffer

To visualize the evolution of regret values in the prioritized regret buffer (PRB) during training, we track the decrease of high-regret states over time. The consistent leftward shift in the regret distributions indicates that RGSC effectively enables the model to correct its mistakes by repeatedly revisiting valuable states.