InternVLA-A1: Unifying Understanding, Generation
and Action for Robotic Manipulation

Video Presentation

Unified Understanding-Generation-Action Framework

InternVLA-A1 unifies scene understanding, visual foresight, and action execution into a single framework:
(1) an understanding expert: parses image and text inputs to encode scene context;
(2) a generation expert: predicts future visual states and task dynamics;
(3) an action expert: utilizes these predictions to generate control commands via Flow Matching.

🧠 The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
🚀 The Fuel: Empowered by high-fidelity synthetic data (InternData-A1).

InternVLA-A1 Architecture

Dynamic Manipulation

We designed two tasks involving manipulation in dynamic scenes: Express Sorting and In-motion Ingredient Picking. Experiments demonstrate that InternVLA-A1 exhibits exceptional robustness in such highly dynamic scenarios.

SHITF + SCROLL TO EXPLORE →

Daily Tasks

InternVLA-A1 demonstrates leading performance across diverse daily tasks, ranging from dexterous manipulation (e.g., Unscrew Cap, Zip Bag, Sort Parts) to regular manipulation (e.g., Make Sandwich, Operate Oven, Sort Rubbish).

Task 01: Sort Parts
Task 02: Zip Bag
Task 03: Unscrew Cap
Task 04: Place Flower
Task 05: Wipe Stain
Task 06: Sort Rubbish
Task 07: Sweep Trash
Task 08: Place Markpen

SHITF + SCROLL TO EXPLORE →

Real-world Performance

InternVLA-A1 demonstrates superior performance compared to the prior state-of-the-art models, \(\pi_0\) (3.3B) and GR00T N1.5 (3B). InternVLA-A1 (3B) reaches an average success rate of 75.1%—a 14.5% absolute improvement over \(\pi_0\) across 10 diverse tasks.

REAL-WORLD PERFORMANCE

Ablation Study

Without large-scale robot pretraining or the generation expert, InternVLA-A1 degrades dramatically.

ABLATION STUDY