OptionZero: Planning with Learned Options

Abstract

Planning with options — a sequence of primitive actions — has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning. Our code is available at https://rlg.iis.sinica.edu.tw/papers/optionzero.

OptionZero

Based on the MuZero algorithm, OptionZero incorporates an option network into prediction network to predict option and its probability, modifies dynamics network to predict next environment state of acting an option, and modifies the Monte Carlo Tree Search (MCTS) to utilize options during planning. Each phase in MCTS in OptionZero is described as following:

Selection: The selection process begins at the root node until an unevaluated node. The selection includes two stages, primitive selection and option selection. The primitive selection only considers primitive child nodes, and if the selected action matches the first action in the option, we proceed with the option selection to determine whether to select the primitive child node or the option child node.
Expansion: In expansion phase, the unevaluated node is evaluated, and its primitive child nodes, option child node, and the internal nodes to the option child node are expanded.
Backup: The backup phase updates the estimated Q-value and visit counts of all edges on the possible path from the root node to the unevaluated node.

Upon the search completed, MCTS selects a child node from the root node based on probabilities proportional to their visit counts and performs the action or option in the environment.

Experiment Results

GridWorld

We train OptionZero in GridWorld with maximum option length set to nine. The figure shows the options learned by OptionZero at different stages of training. In the final stage (100%), the model has learned the optimal shortest path using options.

Atari

We evaluate OptionZero on 26 Atari games with maximum option length set to 1, 3, and 6, denoted as \(\ell_1\) (blue), \(\ell_3\) (orange), and \(\ell_6\) (green), respectively. The model \(\ell_1\) serves as a baseline, identical to MuZero. Our experiments show that both \(\ell_3\) and \(\ell_6\) outperform the baseline \(\ell_1\), and \(\ell_3\) achieves the best overall performance.