Demystifying MuZero Planning: Interpreting the Learned Model

Abstract

MuZero has achieved superhuman performance in various games by using a dynamics network to predict the environment dynamics for planning, without relying on simulators. However, the latent states learned by the dynamics network make its planning process opaque. This paper aims to demystify MuZero’s model by interpreting the learned latent states. We incorporate observation reconstruction and state consistency into MuZero training and conduct an in-depth analysis to evaluate latent states across two board games: 9x9 Go and Gomoku, and three Atari games: Breakout, Ms. Pacman, and Pong. Our findings reveal that while the dynamics network becomes less accurate over longer simulations, MuZero still performs effectively by using planning to correct errors. Our experiments also show that the dynamics network learns better latent states in board games than in Atari games. These insights contribute to a better understanding of MuZero and offer directions for future research to improve the performance, robustness, and interpretability of the MuZero algorithm. The code and data are available at https://rlg.iis.sinica.edu.tw/papers/demystifying-muzero-planning.

Training MuZero with the Decoder

MuZero uses latent states for planning: it transforms observations \(o_t\) into hidden states \(s_t\) at time step \(t\), and uses the learned dynamics network to simulate action transitions. To interpret the information captured in MuZero’s latent states, we extend MuZero by incorporating a decoder network that decodes hidden states \(s_t\) into the reconstructed observations \(\hat{o}_t\). Our work demonstrates that MuZero with the decoder learns well; it not only performs comparably with the original MuZero, but also produces precise reconstructed observations.

The comparison of playing performance.

Game	w/ Decoder	w/o Decoder
Go	1088.74	1000.00
Gomoku	1048.96	1000.00
Breakout	358.90	383.17
Ms. Pacman	4528.70	3732.80
Pong	19.65	20.07

Board games use Elo ratings, and Atari games use average returns.

The comparison between true and reconstructed observations.

	Go	Gomoku	Breakout	Ms. Pacman	Pong
\(o_t\)
\(\hat{o}_t\)

Analyzing MuZero Unrolling

We investigate how dynamics networks degrade over multi-step action unrolling. For example, in each game, we use the dynamics network to unroll hidden states, than use the decoder network to reconstruct their observations, and eventually apply PCA to project these observations. Four kinds of trajectories are visualized:

\(\hat{o}^{(t)}_t\): reconstructed observations, unrolled recursively from the initial state
\(\hat{o}^{(5)}_t\): reconstructed observations, unrolled recursively from five steps earlier
\(\hat{o}_t\): reconstructed observations, directly from the representation network without unrolling
\(o_t\): true observations

Overall, board games demonstrate better alignment than Atari games, indicating that MuZero learns effective dynamics network for board games.

assets/images/PCA-unroll-trajectories.png

Demystifying MuZero Search Tree

To see how inaccuracies affect planning, we decode hidden states of nodes within the MuZero search tree. In the following Gomoku example, the left board of each node shows the true observation and the action; the right board shows the reconstructed observation of the hidden state. Valid states yield clear reconstructions, while invalid states (Nodes C, F, G, H, and I) become blurred, indicating the model’s unfamiliarity. Despite the unfamiliarity, value estimates remain consistent well beyond terminal states (Nodes F, G, H, and I).

Moreover, when averaging over multiple \(N\)-step predictions (the \(N\)-step mean value in the paper), value errors shrink or remain bounded. In Go and Gomoku, deeper averaging from a larger \(N\) mitigates single-step fluctuations; in Pong, value errors are small and unaffected by \(N\), reflecting task simplicity. Overall, these findings demonstrate that the errors may be mitigated or bounded. Therefore, MuZero can still plan effectively even if the unrolled values are inaccurate.

Evaluating Playing Performance

Finally, we evaluate the impact of the number of MCTS simulations on playing strength. Each game’s baseline performance (400 simulations per move) is normalized to 100%, and the relative performance at other simulation counts is plotted.

In summary, MuZero corrects prediction inaccuracies up to a certain depth, but beyond that, additional simulations can reduce performance. For board games, MuZero learns more accurate dynamics networks, allowing its performance to scale effectively with increased simulations.

Downloads

The following links are the MuZero models with decoders for each game. Please check out our code repository for instructions on how to use them.