MaIL: Improving Imitation Learning with Selective State Space Models

Xiaogang Jia1,2, Qian Wang1, Atalay Donat1, Bowen Xing1, Ge Li1, Hongyi Zhou2, Onur Celik1, Denis Blessing1, Rudolf Lioutikov2, Gerhard Neumann1
1Autonomous Learning Robots (ALR), 2Intuitive Robots Lab (IRL)
Karlsruhe Institute of Technology

Abstract

This work presents Mamba Imitation Learning (MaIL), a novel imitation learning (IL) architecture that provides an alternative to state-of-the-art (SoTA) Transformer-based policies. MaIL leverages Mamba, a state-space model designed to selectively focus on key features of the data. While Transformers are highly effective in data-rich environments due to their dense attention mechanisms, they can struggle with smaller datasets, often leading to overfitting or suboptimal representation learning. In contrast, Mamba's architecture enhances representation learning efficiency by focusing on key features and reducing model complexity. This approach mitigates overfitting and enhances generalization, even when working with limited data. Extensive evaluations on the LIBERO benchmark demonstrate that MaIL consistently outperforms Transformers on all LIBERO tasks with limited data and matches their performance when the full dataset is available. Additionally, MaIL's effectiveness is validated through its superior performance in three real robot experiments.

Decoder-Only Mamba

Encoder-Decoder Mamba

Results on LIBERO Benchmark

Performance on LIBERO benchmark with 20% data, where "w/o language" indicates that we do not use language instructions and "w/ language" means we use language tokens generated from a pre-trained CLIP model, H1 and H5 refer to using current state and 5 steps historical states respectively. "D" and "ED" refer to Decoder-Only and Encoder-Decoder. "Tr" and "Ma" refer to Transformer and MaIL respectively. "BC" and "DDP" refer to Behavior Cloning and DDPM respectively.

Quantifying State Representations

To better understand the advantages of MaIL over Transformers, we conducted a detailed analysis of the latent representations produced by both methods. Specifically, we compared the Encoder-Decoder architectures of MaIL and Transformers by visualizing their high-dimensional representations using t-SNE. We employed BC-based models trained on the LIBERO-Object dataset using full demonstrations and rolled out the models across entire trajectories. The visualizations represent the latent spaces before the final action prediction layer (linear).

MaIL Evaluation

LIBERO

Real Robot Experiments

Pick-Place

Inserting

CupStacking

BibTeX

@inproceedings{
jia2024mail,
title={Ma{IL}: Improving Imitation Learning with Selective State Space Models},
author={Xiaogang Jia and Qian Wang and Atalay Donat and Bowen Xing and Ge Li and Hongyi Zhou and Onur Celik and Denis Blessing and Rudolf Lioutikov and Gerhard Neumann},
booktitle={8th Annual Conference on Robot Learning},
year={2024},
url={https://openreview.net/forum?id=IssXUYvVTg}
}