InternVLA-A1: Unifying Understanding, Generation
and
Action for Robotic Manipulation
Video Presentation
Unified Understanding-Generation-Action Framework
InternVLA-A1 unifies scene understanding, visual foresight, and action execution into a single framework:
(1) an understanding expert: parses image and text inputs to encode scene context;
(2) a generation expert: predicts future visual states and task dynamics;
(3) an action expert: utilizes these predictions to generate control commands via Flow Matching.
🧠 The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
🚀 The Fuel: Empowered by high-fidelity synthetic data (InternData-A1).
Dynamic Manipulation
We designed two tasks involving manipulation in dynamic scenes: Express Sorting and In-motion Ingredient Picking. Experiments demonstrate that InternVLA-A1 exhibits exceptional robustness in such highly dynamic scenarios.
SHITF + SCROLL TO EXPLORE →
Daily Tasks
InternVLA-A1 demonstrates leading performance across diverse daily tasks, ranging from dexterous manipulation (e.g., Unscrew Cap, Zip Bag, Sort Parts) to regular manipulation (e.g., Make Sandwich, Operate Oven, Sort Rubbish).
SHITF + SCROLL TO EXPLORE →
Real-world Performance
InternVLA-A1 demonstrates superior performance compared to the prior state-of-the-art models, \(\pi_0\) (3.3B) and GR00T N1.5 (3B). InternVLA-A1 (3B) reaches an average success rate of 75.1%—a 14.5% absolute improvement over \(\pi_0\) across 10 diverse tasks.
Ablation Study
Without large-scale robot pretraining or the generation expert, InternVLA-A1 degrades dramatically.