MMaDA-VLA

Large Diffusion Vision-Language-Action Model with
Unified Multi-Modal Instruction and Generation

1Westlake University 2Zhejiang University 3East China University of Science and Technology 4Huawei Celia Team 5The Hong Kong University of Science and Technology (Guangzhou) 6OpenHelix Robotics
98.0% LIBERO Avg.
4.78 CALVIN Avg. Len.
8B Parameters
Scroll to explore

Abstract

MMaDA-VLA Teaser

Vision-Language-Action (VLA) models aim to control robots for manipulation from visual observations and natural-language instructions. However, existing hierarchical and autoregressive paradigms often introduce architectural overhead, suffer from temporal inconsistency and long-horizon error accumulation, and lack a mechanism to capture environment dynamics without extra modules. To this end, we present MMaDA-VLA, a fully native pre-trained large diffusion VLA model that unifies multi-modal understanding and generation in a single framework. Our key idea is a native discrete diffusion formulation that embeds language, images, and continuous robot controls into one discrete token space and trains a single backbone with masked token denoising to jointly generate a future goal observation and an action chunk in parallel. Iterative denoising enables global, order-free refinement, improving long-horizon consistency while grounding actions in predicted future visual outcomes without auxiliary world models. Experiments across simulation benchmarks and real-world tasks show state-of-the-art performance, achieving 98.0% average success on LIBERO and 4.78 average length on CALVIN.

Unified Framework

Single backbone for multi-modal understanding and generation

Discrete Diffusion

Native token space for language, vision, and robot actions

Goal Prediction

Joint generation of goal observations and action chunks

Long-horizon

Improved temporal consistency without error accumulation

Method

A fully native pre-trained large diffusion VLA model

MMaDA-VLA Architecture
01

Unified Tokenization

Language, images, and robot actions are embedded into a common discrete token space, enabling unified multi-modal modeling.

02

Masked Token Denoising

A single backbone is trained with masked token prediction to jointly model all modalities through iterative denoising.

03

Parallel Generation

Goal observation and action chunk are generated in parallel, enabling global refinement and better long-horizon planning.

04

Closed-Loop Control

Iterative denoising provides order-free refinement, grounding actions in predicted visual outcomes without auxiliary modules.

Results

State-of-the-art performance on simulation benchmarks

Benchmark Performance

Real-World Experiments

Experiment Setup

  • AgileX PiPER Robot Arm
  • Realsense D435 (Third-View)
  • DX200-2.8mm (Wrist-View)
  • 300 demos & 30 trials per task
Experiment Setup

Performance Comparison

Pick-and-Place
GR00T N1.6 63.3%
Ours 86.7%
Stacking
GR00T N1.6 70.0%
Ours 93.3%
Storage
GR00T N1.6 66.7%
Ours 93.3%
Organizing
GR00T N1.6 56.7%
Ours 83.3%
GR00T N1.6
MMaDA-VLA (Ours)

Demos

Different tasks may have varying speedup effects

Pick-and-Place Task

pick the banana into the blue big/small bowl

Stacking Task

stack the white block on the blue/red/yellow block

Storage Task

1. open the drawer; 2. place the pink/red plush keychain/white block; 3. close the drawer

Organizing Task

organize the bowls and cups on the table

Citation

@article{liu2026mmadavla,
  author    = {Yang Liu and Pengxiang Ding and Tengyue Jiang and Xudong Wang and Minghui Lin and Wenxuan Song and Hongyin Zhang and Zifeng Zhuang and Han Zhao and Wei Zhao and Siteng Huang and Jinkui Shi and Donglin Wang},
  title     = {{MMaDA-VLA}: Large Diffusion Vision-Language-Action Model with Multimodal Instruction and Generation},
  journal   = {CoRR},
  volume    = {abs/2603.25406},
  year      = {2026},
  url       = {https://arxiv.org/abs/2603.25406}
}