Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization

Jian-Ting Guo1*, Yu-Cheng Chen1*, Ping-Chun Hsieh1†, Kuo-Hao Ho1, Po-Wei Huang1, Ti-Rong Wu2†, I-Chen Wu1,2
1 National Yang Ming Chiao Tung University
2 Academia Sinica
* Equal contribution Corresponding author

Abstract

Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.

Human-Like Reinforcement Learning

Human-like reinforcement learning remains underexplored in the RL community. Most research focuses on designing reward-driven agents; only a few studies investigate human-like RL that seeks both human-like behavior and optimal performance. But most of these methods rely on pre-defined behavior constraints or rule-based penalties, requiring substanital effort for handcrafed design.

RLPD vs MAQ+RLPD SAC vs MAC+SAC IQL vs MAQ+IQL

MAQ+RLPD (Ours)

RLPD

MAQ+RLPD (Ours)

RLPD


MAQ+RLPD (Ours)

RLPD

MAQ+RLPD (Ours)

RLPD

MAQ+SAC (Ours)

SAC

MAQ+SAC (Ours)

SAC


MAQ+SAC (Ours)

SAC

MAQ+SAC (Ours)

SAC

MAQ+IQL (Ours)

IQL

MAQ+IQL (Ours)

IQL


MAQ+IQL (Ours)

IQL

MAQ+IQL (Ours)

IQL

Macro Action Quantization

We propose a human-likeness aware framework called Macro Action Quantization (MAQ) that consists of two components: (a) Human behavior distillation and (b) Reinforcement learning with Macro Actions.

MAQ Architecture A
Human behavior distillation: We first train a Conditional-VQVAE to distill macro actions \((m_t=(a_t,a_{t+1},...,a_{t+H-1}))\) from human demonstration to learn a discrete codebook; the macro actions are extracted from human demonstrations via a sliding window over action trajectories.
MAQ Architecture B
Reinforcement learning with Macro Actions: We then train an onlin policy (\(\pi_\theta\)) acts in the learned discrete code space by selecting codebook indices, then the decoded macro action interact with the environment.

Experiment Results

Trajectory Similarity Evaluation

Two trajectory similarity metrics are used to evaluate how closely agent behaviors align with human demonstrations: Dynamic Time Warping (DTW) and Wasserstein Distance (WD). We then incorporate MAQ with three different RL algorithms: IQL, SAC, and RLPD. The results show that MAQ can significantly improve all of the original RL algorithm similarity scores in each task.

Trajectory Similarity Score with Different Macro Action Lengths

The trajectory similarity scores are evaluated with different macro action lengths in MAQ+RLPD. The results show that not only are the trajectory similarity scores improved, but also the performance increases with the increase of macro action lengths.
Trajectory similarity scores and success rates with different lengths in MAQ+RLPD

Human Evaluation Study

We further evaluate the human-likeness of the agents through a human evaluation study. We conduct two stage questionnaire, a Turing Test and a human-likeness ranking test.

Turing Test

There are several two-alternative forced-choice (2AFC) questions, where evaluators are shown two videos - one from human demonstrations and the other from the trained agent - and asked to choose which one was performed by humans.

The results of the Turing Test
Technique Comparison Heatmap with Errors

Human-likeness ranking test

Similar to the Turing Test, also do 2AFC questions, but there are neither both agent videos nor one human video, one agent video. In this stage, each evaluator is asked to complete multiple questions during this test to finish all the 2AFC that are able to rank the human-likeness of the agents.

The results of the human-likeness ranking test
Phase 2 Survey Barchart

Model Download