Dataset#
This section introduces how to prepare, organize, and convert datasets into the unified LeRobotDataset format used by InternNav. Youโll learn:
These steps ensure compatibility with our training and evaluation framework across all supported benchmarks.
Dataset Structure & Format Specification#
InternNav adopts the LeRobotDataset
format, which standardizes how videos, instructions, actions, and metadata are organized.
Each episode is stored in both .parquet
format for structured access and .mp4
for visualization.
The general directory structure looks like:
<datasets_root>
โ
โโโ <sub_dataset_1> # Dataset for Env 1 (e.g., 3dfront_zed)
โ โ
โ โโโ <scene_dataset_1> # Dataset for Scene 1
โ โ โ
โ โ โโโ <traj_dataset_1> # Dataset for Trajectory 1
โ โ โ โโโ data # Structured episode data in .parquet format
โ โ โ โ โโโ chunk-000
โ โ โ โ โโโ episode_000000.parquet
โ โ โ โ
โ โ โ โโโ meta # Metadata and statistical information
โ โ โ โ โโโ episodes_stats.jsonl # Per-episode stats
โ โ โ โ โโโ episodes.jsonl # Per-episode metadata (subtask, instruction, etc.)
โ โ โ โ โโโ info.json # Dataset-level information
โ โ โ โ โโโ tasks.jsonl # Task definitions
โ โ โ โ
โ โ โ โโโ videos # Observation videos for each episode
โ โ โ โ
โ โ โ โโโ chunk-000 # Videos for episodes 000000 - 000999
โ โ โ โโโ observation.images.depth # Depth images for each trajectory
โ โ โ โ โโโ 0.png # Depth image for each frame
โ โ โ โ โโโ 1.png
โ โ โ โ โโโ ...
โ โ โ โโโ observation.images.rgb # RGB images for each trajectory
โ โ โ โ โโโ 0.jpg # RGB image for each frame
โ โ โ โ โโโ 1.jpg
โ โ โ โ โโโ ...
โ โ โ โโโ observation.video.depth # Depth video for each trajectory
โ โ โ โ โโโ episode_000000.mp4
โ โ โ โโโ observation.video.trajectory # RGB video for each trajectory
โ โ โ โโโ episode_000000.mp4
โ โ โ
โ โ โโโ <traj_dataset_2>
โ โ โ
โ โ โโโ ...
โ โ
โ โโโ <scene_dataset_2>
โ โ
โ โโโ ...
โ
โ
โโโ <sub_dataset_2>
โ
โโโ ...
Metadata Files (Inside meta/
)#
The meta/
folder contains critical metadata and statistics that power training, evaluation, and debugging.
1. episodes_stats.jsonl
This file stores per-episode, per-feature statistical metadata for a specific dataset. Each line in the JSONL file corresponds to a single episode and contains the following:
episode_index
: The index of the episode within the dataset.stats
: A nested dictionary mapping each feature (e.g., RGB images, joint positions, robot states, actions) to its statistical summary:min
: Minimum value for the feature across all timesteps in the episode.max
: Maximum value.mean
: Mean value.std
: Standard deviation.count
: Number of frames in the episode that the feature was observed.
โ ๏ธ Note: The dimensions of
min
,max
,mean
, andstd
match the dimensionality of the original feature, whilecount
is a scalar.
Example entry:
{
"episode_index": 0,
"stats": {
"observation.images.rgb": {
"min": [[[x]], [[x]], [[x]]],
"max": [[[x]], [[x]], [[x]]],
"mean": [[[x]], [[x]], [[x]]],
"std": [[[x]], [[x]], [[x]]],
"count": [300]
},
"observation.images.depth": {
"min": [x, x, ..., x],
"max": [x, x, ..., x],
"mean": [x, x, ..., x],
"std": [x, x, ..., x],
"count": [300]
},
...
}
}
2. episodes.jsonl
This file stores per-episode metadata for a specific task split (e.g., task_0
, task_1
, etc.). Each line represents one episode and includes basic information used by the training framework.
Fields:
episode_index
: A unique identifier for the episode.tasks
: A list of high-level task descriptions. Each description defines the goal of the episode and should match the corresponding entry intasks.jsonl
.length
: The total number of frames in this episode.
This file serves as the primary index of available episodes and their corresponding task goals.
Example:
{
"episode_index": 0,
"tasks": [
"Go straight down the hall and up the stairs. When you reach the door to the gym, go left into the gym and stop... "
],
"length": 57
}
3. info.json
This file contains metadata shared across the entire task_n
split. It summarizes key information about the dataset, including device specifications, recording configuration, and feature schemas.
Fields:
codebase_version
: The version of the data format used (e.g.,"v2.1"
).robot_type
: The robot hardware platform used for data collection (e.g.,"a2d"
,"unknown"
).total_episodes
: Total number of episodes available in this task split.total_frames
: Total number of frames across all episodes.total_tasks
: Number of distinct high-level tasks (usually 1 for each task_n).total_videos
: Total number of video files in this task split.total_chunks
: Number of data chunks (each chunk typically contains โค1000 episodes).chunks_size
: The size of each chunk (usually 1000).fps
: Frames per second used for both video and robot state collection.splits
: Dataset split definitions (e.g.,"train": "0:503"
).data_path
: Pattern for loading parquet files, whereepisode_chunk
andepisode_index
are formatted.video_path
: Pattern for loading video files corresponding to each camera.features
: The structure and semantics of all recorded data streams.
Specifically, the features
field specifies the structure and semantics of all recorded data streams. It includes four categories:
Video / Image Features Each entry includes:
dtype
:"video"
/"image"
shape
: Spatial resolution and channelsnames
: Axis names (e.g.,["height", "width", "rgb"]
)info
: Video-specific metadata such as codec, resolution, format, and whether it contains depth or audio.
Observation Features These include robot sensor readings (e.g., joint positions, effector poses) and follow this schema:
dtype
: Data type (e.g.,"float32"
)shape
: Tensor shapenames
: Dictionary describing physical meaning, usually under"motors"
Action Features These specify target actuator values or commands and have the same format as observation features.
Other Features These include auxiliary info such as timestamps, frame/episode indices, etc. Some of them (e.g.,
timestamp
) are currently unused but included for completeness.
Example Snippet (simplified)๏ผ
{
"codebase_version": "v2.1",
"robot_type": "unknown",
"total_episodes": 1,
"total_frames": 152,
"total_tasks": 1,
"total_videos": 2,
"total_chunks": 1,
"chunks_size": 1000,
"fps": 30,
"features": {
"observation.images.rgb": {
"dtype": "image",
"shape": [
270,
480,
3
],
"names": [
"height",
"width",
"channel"
]
},
"observation.images.depth": {
"dtype": "image",
"shape": [
270,
480,
3
],
"names": [
"height",
"width",
"channel"
]
},
"observation.video.trajectory": {
"dtype": "video",
"shape": [
270,
480,
3
],
"names": [
"height",
"width",
"channel"
],
"info": {
"video.height": 272,
"video.width": 480,
"video.codec": "h264",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 10,
"video.channels": 3,
"has_audio": false
}
},
"observation.video.depth": {
"dtype": "video",
"shape": [
270,
480,
3
],
"names": [
"height",
"width",
"channel"
],
"info": {
"video.height": 272,
"video.width": 480,
"video.codec": "h264",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 10,
"video.channels": 3,
"has_audio": false
}
},
"observation.camera_intrinsic": {
"dtype": "float32",
"shape": [
3,
3
],
"names": [
"intrinsic_0_0",
"intrinsic_0_1",
"intrinsic_0_2",
...
]
},
"observation.camera_extrinsic": {
"dtype": "float32",
"shape": [
4,
4
],
"names": [
"extrinsic_0_0",
"extrinsic_0_1",
"extrinsic_0_2",
...
]
},
"observation.path_points": {
"dtype": "float64",
"shape": [
36555,
3
],
"names": [
"x",
"y",
"z"
]
},
"observation.path_colors": {
"dtype": "float32",
"shape": [
36555,
3
],
"names": [
"r",
"g",
"b"
]
},
"action": {
"dtype": "float32",
"shape": [
4,
4
],
"names": [
"action_0_0",
"action_0_1",
"action_0_2",
...
]
},
...
}
4. tasks.jsonl
This file defines the unified high-level task associated with the current dataset.
Fields:
task_index
: The index of the task.task
: A natural language description of the task scenario, including the environment setup and overall objective.
Example:
{
"task_index": 0,
"task": "Go straight to the hallway and then turn left. Go past the bed. Veer to the right and go through the white door. Stop when you're in the doorway."
}
Convert to LeRobotDataset#
InternNav adopts the LeRobot format for all supported datasets. This section explains how to convert popular datasets โ VLN-CE โ into this format using our provided conversion scripts.
VLN-CE โ LeRobot#
VLN-CE (Vision and Language Navigation in Continuous Environments) is a benchmark for instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained agent navigation. Download it here.
Download our source code:
# clone our repo git clone https://github.com/InternRobotics/InternNav.git # In trans_Lerobot env cd /scripts/dataset_converters/vlnce2lerobot # clone lerobot git clone -b user/michel-aractingi/2025-05-20-hil-rebase-robots https://github.com/huggingface/lerobot.git cd lerobot # Create a virtual environment with Python 3.10 and activate it conda create -y -n lerobot python=3.10 conda activate lerobot
Install
ffmpeg
in your environment:conda install ffmpeg -c conda-forge #Additionally, you can also install ffmpeg using sudo๏ผ sudo apt update sudo apt install ffmpeg
Adapt for InternNav and Execute the script:
To better accommodate the structure of navigation task datasets:
Inherit from and extend the LeRobotDataset class by creating a new subclass called NavDataset.
Inherit from and extend the LeRobotDatasetMetadata class by creating a new subclass called NavDatasetMetadata.
python vlnce2lerobot.py \ --data_dir /your/path/vln \ # Root folder --datasets RxR \ # Which dataset split to process (RxR, R2R, โฆ) --start_index 0 \ --end_index 2000 \ --repo_name vln_ce_lerobot \ --num_threads 10
Collect Demonstration Dataset in GRUtopia#
Support for collecting demos via GRUtopia simulation is coming soon โ stay tuned!