Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at https://rlg.iis.sinica.edu.tw/papers/rgsc.
RGSC extends AlphaZero by identifying and prioritizing high-regret states as search control openings for self-play in board games. RGSC guides self-play to begin from states with higher regret, where regret reflects positions that the current agent has not yet mastered. Several key components in RGSC are described as follows:
We first investigate search control in a toy environment, an \(n\)-level sparse-reward binary tree, where each leaf node is assigned an expected reward value p ∈ [0, 1]. Experiment results demonstrate the importance of prioritizing states with high learning potential and show the effectiveness of the regret-guided search control.
We compare RGSC against two baseline methods three board games (9x9 Go, 10x10 Othello, and 11x11 Hex):
We demonstrate that RGSC can still improve when training from an already well-trained model. We select a large 15-block baseline model trained with the AlphaZero algorithm on 9x9 Go, which already achieves a strong playing strength. We then continue training this model to evaluate the effectiveness of RGSC in training well-trained models.
We also compare against a KataGo model of the same block size; the baseline achieves a win rate of 69.3%. After training, RGSC further improves the win rate to 78.2%, while Go-Exploit remains at 69.2% and AlphaZero achieves 70.2%.
To visualize the evolution of regret values in the prioritized regret buffer (PRB) during training, we track the decrease of high-regret states over time. The consistent leftward shift in the regret distributions indicates that RGSC effectively enables the model to correct its mistakes by repeatedly revisiting valuable states.