Human-like agents have long been one of the goals in pursuing artificial intelligence. Although reinforcement learning (RL) has achieved superhuman performance in many domains, relatively little attention has been focused on designing human-like RL agents. As a result, many reward-driven RL agents often exhibit unnatural behaviors compared to humans, raising concerns for both interpretability and trustworthiness. To achieve human-like behavior in RL, this paper first formulates human-likeness as trajectory optimization, where the objective is to find an action sequence that closely aligns with human behavior while also maximizing rewards, and adapts the classic receding-horizon control to human-like learning as a tractable and efficient implementation. To achieve this, we introduce Macro Action Quantization (MAQ), a human-like RL framework that distills human demonstrations into macro actions via Vector-Quantized VAE. Experiments on D4RL Adroit benchmarks show that MAQ significantly improves human-likeness, increasing trajectory similarity scores, and achieving the highest human-likeness rankings among all RL agents in the human evaluation study. Our results also demonstrate that MAQ can be easily integrated into various off-the-shelf RL algorithms, opening a promising direction for learning human-like RL agents. Our code is available at https://rlg.iis.sinica.edu.tw/papers/MAQ.
Human-like reinforcement learning remains underexplored in the RL community. Most research focuses on designing reward-driven agents; only a few studies investigate human-like RL that seeks both human-like behavior and optimal performance. But most of these methods rely on pre-defined behavior constraints or rule-based penalties, requiring substanital effort for handcrafed design.
MAQ+RLPD (Ours)
RLPD
MAQ+RLPD (Ours)
RLPD
MAQ+RLPD (Ours)
RLPD
MAQ+RLPD (Ours)
RLPD
MAQ+SAC (Ours)
SAC
MAQ+SAC (Ours)
SAC
MAQ+SAC (Ours)
SAC
MAQ+SAC (Ours)
SAC
MAQ+IQL (Ours)
IQL
MAQ+IQL (Ours)
IQL
MAQ+IQL (Ours)
IQL
MAQ+IQL (Ours)
IQL
We propose a human-likeness aware framework called Macro Action Quantization (MAQ) that consists of two components: (a) Human behavior distillation and (b) Reinforcement learning with Macro Actions.
Two trajectory similarity metrics are used to evaluate how closely agent behaviors align with human demonstrations: Dynamic Time Warping (DTW) and Wasserstein Distance (WD). We then incorporate MAQ with three different RL algorithms: IQL, SAC, and RLPD. The results show that MAQ can significantly improve all of the original RL algorithm similarity scores in each task.
We further evaluate the human-likeness of the agents through a human evaluation study. We conduct two stage questionnaire, a Turing Test and a human-likeness ranking test.
There are several two-alternative forced-choice (2AFC) questions, where evaluators are shown two videos - one from human demonstrations and the other from the trained agent - and asked to choose which one was performed by humans.
Similar to the Turing Test, also do 2AFC questions, but there are neither both agent videos nor one human video, one agent video. In this stage, each evaluator is asked to complete multiple questions during this test to finish all the 2AFC that are able to rank the human-likeness of the agents.