InternVLA-A1: Unifying Understanding, Generation
and Action for Robotic Manipulation

Video Presentation

Unified Understanding-Generation-Action Framework

InternVLA-A1 Teaser


🔮 The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.
🚀 The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.
⚡ The Output: Tackles highly dynamic scenarios with effortless mastery.

InternVLA-A1 Method

InternVLA-A1 unifies scene understanding, visual foresight, and action execution via a Mixture-of-Transformers (MoT) framework:
(1) an understanding expert: parses image and text inputs to encode scene context;
(2) a generation expert: predicts future visual states and task dynamics;
(3) an action expert: utilizes these predictions to generate control commands via Flow Matching.

Dynamic Manipulation Tasks

『SHITF + SCROLL』 or 『→ ←』 to explore more demos

InternVLA-A1 exhibits remarkable superiority in highly dynamic scenarios, such as Express Sorting and In-motion Ingredient Picking, outperforming pi0.5 by +26.7%.

Real-world Performance Results

Static Manipulation Tasks

InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation (e.g., Unscrew Cap, Zip Bag, Sort Parts).

Task 01: Sort Parts
Task 02: Zip Bag
Task 03: Unscrew Cap
Task 04: Place Flower
Task 05: Wipe Stain
Task 06: Sort Rubbish
Task 07: Sweep Trash
Task 08: Place Markpen

『SHITF + SCROLL』 or 『→ ←』 to explore more demos

Ablation Studies

Impact of Pre-training: without large-scale pretraining, the performance degrades dramatically. This highlights pre-training as a critical inductive prior.

Ablation Study on Pre-training

Impact of Pre-training Dataset: jointly pre-training on heterogeneous data sources (human videos, synthetic data, and real-world demonstrations ) achieves the best overall performance, demonstrating the effectiveness of our joint training strategy.

Ablation Study on Pre-training Dataset

Impact of Generation Expert: removing the generation expert significantly reduces the average success rate from 77.0% to 57.6%, which validates the superiority of the proposed generation expert and the unified architecture integrating understanding, generation, and action.

Ablation Study on Generation Expert