InternVLA-M1

Latent Spatial Grounding for Instruction-Following Robotic Manipulation

Abstract

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine “where to act” by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide “how to act” by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recepit yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +13.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka. To further scale instruction following, we built a simulation engine to collect 244K pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10 points. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots.

Model Overview

InternVLA-M1 integrates spatial grounding into the vision–language–action training pipeline. Given a task instruction, the VLM planner produces latent plans through explicit spatial prompting, which then effectively guides the action expert to generate control signals.

Simulation Data Generation

The pipeline automatically generates diverse instruction-following robotic manipulation data from a large asset library, incorporating intermediate representations such as Box, Point, and Trajectory, which can be further converted into VLM spatial grounding data.

Results

Watch InternVLA-M1 perform instruction-following manipulation tasks in both large-scale simulated environments and real-world tasks.

Instruction-Following Manipulation

Cluttered-scene Pick-and-Place

Long-horizon and Reasoning Manipulation

Make Sandwich

Purchase Goods

Sort Objects

Sort to Drawer

Math Calculation

Experimental Results

InternVLA-M1 demonstrates superior performance across various challenging scenarios

Performance Comparison on Simpler Env and Libero Benchmark

GR00T

π0

InternVLA-M1

Effects of Spatially Guided VLA Training

Vanilla VLA

InternVLA-M1

System2 Spatial Reasoning Results

Demonstrating InternVLA's System 2 capabilities in box detection, point localization, and visual trace prediction.

📦 Box Detection

Precise bounding box detection and object localization

"Indicate points within the vacant area that lies between the blue cup and the teal bowl on the table."

"the lady on the right with the red necktie."

"What is the oval-shaped, charcoal gray object around the vintage camera."

"a bush of plant behind middle woman."

"Put the lettuce on the plate."

"put the grapes on the basket."

"What is the closest object to cabinet?"

"Put the bun on the plate."

"Put the pear in the white basket."

"a girl sitting on a bench near a boy."

"could you please give me the glue stick."

"Indicate points within the vacant area that lies between the blue cup and the teal bowl on the table."

"the lady on the right with the red necktie."

"What is the oval-shaped, charcoal gray object around the vintage camera."

"a bush of plant behind middle woman."

"Put the lettuce on the plate."

"put the grapes on the basket."

"What is the closest object to cabinet?"

"Put the bun on the plate."

"Put the pear in the white basket."

"a girl sitting on a bench near a boy."

"could you please give me the glue stick."

📍 Point Localization

Precise keypoint localization and spatial analysis

"What on the table is used for computing tasks?"

"a toddler in gray pants and a striped shirt."

"press the blue botton."

"put the grapes on the basket."

"soup with chicken and carrots and yellow broth."

"a small bird hanging on the net."

"could you find the light brown lion."

"I hope to get the small blue oblong box."

"I want to get the large toothpaste box."

"put the eggplant on the purple plate."

"Locate points within the unoccupied space that lies before the leftmost fruit on the table."

"Put the pear in the white basket."

"Put the lettuce on the plate."

"What on the table is used for computing tasks?"

"a toddler in gray pants and a striped shirt."

"press the blue botton."

"put the grapes on the basket."

"soup with chicken and carrots and yellow broth."

"a small bird hanging on the net."

"could you find the light brown lion."

"I hope to get the small blue oblong box."

"I want to get the large toothpaste box."

"put the eggplant on the purple plate."

"Locate points within the unoccupied space that lies before the leftmost fruit on the table."

"Put the pear in the white basket."

"Put the lettuce on the plate."

📈 Trajectory Prediction

Intelligent trajectory prediction and motion path planning

"put the coke on the front side of the table."

"Put the pear in the white basket."

"Put the lettuce on the plate."

"put eggplant into yellow basket."

"open the middle drawer."

"close the drawer."

"open the middle drawer."

"stack the green block on the yellow one."

"put the grapes on the basket."

"put the spoon on the towel."

"put coke."

"place apple into top drawer."

"put carrot on plate."

"close the drawer."

"Put the bun on the plate."

"close the bottom drawer."

"put the coke on the front side of the table."

"Put the pear in the white basket."

"Put the lettuce on the plate."

"put eggplant into yellow basket."

"open the middle drawer."

"close the drawer."

"open the middle drawer."

"stack the green block on the yellow one."

"put the grapes on the basket."

"put the spoon on the towel."

"put coke."

"place apple into top drawer."

"put carrot on plate."

"close the drawer."

"Put the bun on the plate."

"close the bottom drawer."

VLM Pre-training Data Distribution

Comprehensive dataset composition for spatial grounding pre-training

VLM Training Data

General VQA 21.0%

Spatial Grounding QA 79.0%

Detailed Components

VQA 21.0%

Trajectory-QA 22.6%

Point-QA 27.4%

BOX-QA 29.0%

Reference

@article{2025internvlam1,
  title = {InternVLA-M1: Latent Spatial Grounding for Instruction-Following Robotic Manipulation},
  author = {Intern Robotics},
  booktitle = {Arxiv},
  year = {2025},
}