Game Solving with Online Fine-Tuning

Ti-Rong Wu1, Hung Guei1, Ting Han Wei2, Chung-Chin Shih1,3, Jui-Te Chin3, I-Chen Wu1,3
1 Academia Sinica, 2 University of Alberta, 3 National Yang Ming Chiao Tung University

Abstract

Game solving is a similar, yet more difficult task than mastering a game. Solving a game typically means to find the game-theoretic value (outcome given optimal play), and optionally a full strategy to follow in order to achieve that outcome. The AlphaZero algorithm has demonstrated super-human level play, and its powerful policy and value predictions have also served as heuristics in game solving. However, to solve a game and obtain a full strategy, a winning response must be found for all possible moves by the losing player. This includes very poor lines of play from the losing side, for which the AlphaZero self-play process will not encounter. AlphaZero-based heuristics can be highly inaccurate when evaluating these out-of-distribution positions, which occur throughout the entire search. To address this issue, this paper investigates applying online fine-tuning while searching and proposes two methods to learn tailor-designed heuristics for game solving. Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54% of computation time compared to the baseline without online fine-tuning. Results suggest that the savings scale with problem size. Our method can further be extended to any tree search algorithm for problem solving. Our code is available at https://rlg.iis.sinica.edu.tw/papers/neurips2023-online-fine-tuning-solver.

Online Fine-Tuning Solver

assets/online-fine-tuning-solver-architecture.svg
The proposed online fine-tuning game solver comprises a manager, a set of workers, and an online fine-tuning trainer.
  • The manager maintains a search tree. During the proof search, the manager (1) employs a heuristic (i.e., an AlphaZero network model) to assign jobs to the workers to solve; (2) forwards solved positions or critical positions to the trainer to fine-tune.
  • The workers use the same heuristic as the manager, attempt to solve the jobs within given constraints, and then return the solutions.
  • The online fine-tuning trainer fine-tunes the heuristic using received positions, then updates the heuristic the manager and worker use.

In online fine-tuning, the manager forwards positions that require fine-tuning to the trainer.

Experiment Results

assets/openings.svg
We select a set of three-move openings based on expert recommendations. These openings can be classified into four groups, named after their commonly shared first move opening: Jump (J), Knight's move (K), Diagonal jump (D), and Stretch (S).
Each opening is named according to its group and position; for instance, JA represents the position resulting from Black playing at A in the jump group.
assets/solving-performance.svg
Our experiments show that using online fine-tuning can solve a series of challenging 7x7 Killall-Go problems, using only 23.54% of computation time compared to the baseline without fine-tuning.
  • BASELINE: 1,021,692 seconds.
  • ONLINE-SP: 538,092 seconds.
  • ONLINE-CP: 240,556 seconds.
  • ONLINE-SP+CP: 255,359 seconds.

Solution Trees

This section provides the solution trees (in Smart Game Format; SGF) solved by the BASELINE, ONLINE-SP, ONLINE-CP, and ONLINE-SP+CP solvers. Please refer to the detailed instructions to view these solution trees.