1 Academia Sinica,
2 National Yang Ming Chiao Tung University,
3 Kochi University of Technology
Abstract
This paper presents MiniZero, a zero-knowledge learning framework that supports four state-of-the-art
algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these algorithms have
demonstrated super-human performance in many games, it remains unclear which among them is most suitable or
efficient for specific tasks. Through MiniZero, we systematically evaluate the performance of each
algorithm
in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. For two board games, using more
simulations generally results in higher performance. However, the choice of AlphaZero and MuZero may differ
based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each
game has unique characteristics, different algorithms and simulations yield varying results. In addition, we
introduce an approach, called progressive simulation, which progressively increases the simulation budget
during training to allocate computation more efficiently. Our empirical results demonstrate that progressive
simulation achieves significantly superior performance in two board games. By making our framework and
trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge
learning algorithms, assisting researchers in algorithm selection and comparison against these
zero-knowledge learning baselines. Our code and data are available at
https://rlg.iis.sinica.edu.tw/papers/minizero.
Architecture
MiniZero supports AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero.
Its architecture comprises four components: a server, a set of self-playworkers, an
optimizationworker,
and datastorage.
Server controls the training process and managing the workers. In each iteration, it first
instructs all
self-play workers to generate self-play games simultaneously. Then, it instructs the optimization worker
to load the latest game records and start network updates. This process is repeated until the maximum
number of training iteration is reached.
Self-play worker interacts with the environment to produce self-play games. Each worker
maintains multiple MCTS instances to play multiple games simultaneously. Finished self-play games are
sent to the server and forwarded to the data storage by the server.
Optimization worker updates the network using collected self-play games. It loads self-play
games from data storage and then updates the network. The updated networks are stored into the data
storage.
Data storage stores network files and self-play games. It uses the Network File System (NFS) for
sharing data across different machines.
Furthermore, MiniZero implements several improvement methods.
MCTS estimated Q value for non-visited actions is an improvement that initializes all non-visited
nodes to an estimated Q value instead of 0.
Progressive simulation for Gumbel Zero is an improvement that gradually increases the number of
simulations during training.
Experiment Results
We evaluate the performance of four zero-knowledge learning algorithms: AlphaZero, MuZero, Gumbel AlphaZero,
and Gumbel MuZero.
The four algorithms are denoted as α0, μ0, g-α0, and
g-μ0, respectively.
The numbers of MCTS simulations for training are denoted using n, e.g., n = 200 indicating using 200
simulations.
For training with progressive simulation, we use a range to represent, e.g., n ∈ [2, 50] indicating
a range from 2 to 200 simulations.
Through MiniZero, we compare the performance of these algorithms with different simulations on the 9x9
Go, 8x8
Othello, and Atari games.
Board Games
For two board games, using more simulations generally results in higher performance.
The choice of AlphaZero and MuZero may differ based on game properties.
Gumbel Zero with fewer simulations can achieve performance nearly on par with AlphaZero/MuZero when
trained for equivalent time.
Progressive simulation achieves significantly superior performance.
Atari Games
For 57 Atari games, different algorithms and simulations yield varying results.
Both MuZero and Gumbel MuZero are worth considering, since each game has unique characteristics.
Training with n = 50 generally yields better results; it is noteworthy that using n = 50 does not
consistently outperform n = 18 and n = 2.