# Dataset This section introduces how to **prepare, organize, and convert datasets** into the unified [LeRobotDataset](https://github.com/huggingface/lerobot) format used by InternNav. You’ll learn: - 📁 [How to structure the dataset](#dataset-format) - 🔁 [How to convert popular datasets like VLN-CE](#convert-to-lerobotdataset) - 🎮 [How to collect your own demonstrations in GRUtopia](#collect-demonstration-dataset-in-grutopia) These steps ensure compatibility with our training and evaluation framework across all supported benchmarks. ## Dataset Structure & Format Specification InternNav adopts the [`LeRobotDataset`](https://github.com/huggingface/lerobot) format, which standardizes how videos, instructions, actions, and metadata are organized. Each episode is stored in both `.parquet` format for structured access and `.mp4` for visualization. The general directory structure looks like: ```bash │ ├── # Dataset for Env 1 (e.g., 3dfront_zed) │ │ │ ├── # Dataset for Scene 1 │ │ │ │ │ ├── # Dataset for Trajectory 1 │ │ │ ├── data # Structured episode data in .parquet format │ │ │ │ └── chunk-000 │ │ │ │ └── episode_000000.parquet │ │ │ │ │ │ │ ├── meta # Metadata and statistical information │ │ │ │ ├── episodes_stats.jsonl # Per-episode stats │ │ │ │ ├── episodes.jsonl # Per-episode metadata (subtask, instruction, etc.) │ │ │ │ ├── info.json # Dataset-level information │ │ │ │ └── tasks.jsonl # Task definitions │ │ │ │ │ │ │ └── videos # Observation videos for each episode │ │ │ │ │ │ │ └── chunk-000 # Videos for episodes 000000 - 000999 │ │ │ ├── observation.images.depth # Depth images for each trajectory │ │ │ │ ├── 0.png # Depth image for each frame │ │ │ │ ├── 1.png │ │ │ │ └── ... │ │ │ ├── observation.images.rgb # RGB images for each trajectory │ │ │ │ ├── 0.jpg # RGB image for each frame │ │ │ │ ├── 1.jpg │ │ │ │ └── ... │ │ │ ├── observation.video.depth # Depth video for each trajectory │ │ │ │ └── episode_000000.mp4 │ │ │ └── observation.video.trajectory # RGB video for each trajectory │ │ │ └── episode_000000.mp4 │ │ │ │ │ ├── │ │ │ │ │ └── ... │ │ │ ├── │ │ │ └── ... │ │ ├── │ └── ... ``` ### Metadata Files (Inside `meta/`) The `meta/` folder contains critical metadata and statistics that power training, evaluation, and debugging. **1. episodes_stats.jsonl** This file stores per-episode, per-feature statistical metadata for a specific dataset. Each line in the JSONL file corresponds to a single episode and contains the following: - `episode_index`: The index of the episode within the dataset. - `stats`: A nested dictionary mapping each feature (e.g., RGB images, joint positions, robot states, actions) to its statistical summary: - `min`: Minimum value for the feature across all timesteps in the episode. - `max`: Maximum value. - `mean`: Mean value. - `std`: Standard deviation. - `count`: Number of frames in the episode that the feature was observed. > ⚠️ **Note**: The dimensions of `min`, `max`, `mean`, and `std` match the dimensionality of the original feature, while `count` is a scalar. **Example entry:** ```json { "episode_index": 0, "stats": { "observation.images.rgb": { "min": [[[x]], [[x]], [[x]]], "max": [[[x]], [[x]], [[x]]], "mean": [[[x]], [[x]], [[x]]], "std": [[[x]], [[x]], [[x]]], "count": [300] }, "observation.images.depth": { "min": [x, x, ..., x], "max": [x, x, ..., x], "mean": [x, x, ..., x], "std": [x, x, ..., x], "count": [300] }, ... } } ``` --- **2. episodes.jsonl** This file stores per-episode metadata for a specific task split (e.g., `task_0`, `task_1`, etc.). Each line represents one episode and includes basic information used by the training framework. **Fields:** - `episode_index`: A unique identifier for the episode. - `tasks`: A list of high-level task descriptions. Each description defines the goal of the episode and should match the corresponding entry in `tasks.jsonl`. - `length`: The total number of frames in this episode. This file serves as the primary index of available episodes and their corresponding task goals. **Example:** ```json { "episode_index": 0, "tasks": [ "Go straight down the hall and up the stairs. When you reach the door to the gym, go left into the gym and stop... " ], "length": 57 } ``` --- **3. info.json** This file contains metadata shared across the entire `task_n` split. It summarizes key information about the dataset, including device specifications, recording configuration, and feature schemas. **Fields:** - `codebase_version`: The version of the data format used (e.g., `"v2.1"`). - `robot_type`: The robot hardware platform used for data collection (e.g., `"a2d"`,`"unknown"`). - `total_episodes`: Total number of episodes available in this task split. - `total_frames`: Total number of frames across all episodes. - `total_tasks`: Number of distinct high-level tasks (usually 1 for each task_n). - `total_videos`: Total number of video files in this task split. - `total_chunks`: Number of data chunks (each chunk typically contains ≤1000 episodes). - `chunks_size`: The size of each chunk (usually 1000). - `fps`: Frames per second used for both video and robot state collection. - `splits`: Dataset split definitions (e.g., `"train": "0:503"`). - `data_path`: Pattern for loading parquet files, where `episode_chunk` and `episode_index` are formatted. - `video_path`: Pattern for loading video files corresponding to each camera. - `features`: The structure and semantics of all recorded data streams. Specifically, the `features` field specifies the structure and semantics of all recorded data streams. It includes four categories: 1. **Video / Image Features** Each entry includes: - `dtype`: `"video"`/ `"image"` - `shape`: Spatial resolution and channels - `names`: Axis names (e.g., `["height", "width", "rgb"]`) - `info`: Video-specific metadata such as codec, resolution, format, and whether it contains depth or audio. 2. **Observation Features** These include robot sensor readings (e.g., joint positions, effector poses) and follow this schema: - `dtype`: Data type (e.g., `"float32"`) - `shape`: Tensor shape - `names`: Dictionary describing physical meaning, usually under `"motors"` 3. **Action Features** These specify target actuator values or commands and have the same format as observation features. 4. **Other Features** These include auxiliary info such as timestamps, frame/episode indices, etc. Some of them (e.g., `timestamp`) are currently unused but included for completeness. **Example Snippet (simplified):** ```json { "codebase_version": "v2.1", "robot_type": "unknown", "total_episodes": 1, "total_frames": 152, "total_tasks": 1, "total_videos": 2, "total_chunks": 1, "chunks_size": 1000, "fps": 30, "features": { "observation.images.rgb": { "dtype": "image", "shape": [ 270, 480, 3 ], "names": [ "height", "width", "channel" ] }, "observation.images.depth": { "dtype": "image", "shape": [ 270, 480, 3 ], "names": [ "height", "width", "channel" ] }, "observation.video.trajectory": { "dtype": "video", "shape": [ 270, 480, 3 ], "names": [ "height", "width", "channel" ], "info": { "video.height": 272, "video.width": 480, "video.codec": "h264", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 10, "video.channels": 3, "has_audio": false } }, "observation.video.depth": { "dtype": "video", "shape": [ 270, 480, 3 ], "names": [ "height", "width", "channel" ], "info": { "video.height": 272, "video.width": 480, "video.codec": "h264", "video.pix_fmt": "yuv420p", "video.is_depth_map": false, "video.fps": 10, "video.channels": 3, "has_audio": false } }, "observation.camera_intrinsic": { "dtype": "float32", "shape": [ 3, 3 ], "names": [ "intrinsic_0_0", "intrinsic_0_1", "intrinsic_0_2", ... ] }, "observation.camera_extrinsic": { "dtype": "float32", "shape": [ 4, 4 ], "names": [ "extrinsic_0_0", "extrinsic_0_1", "extrinsic_0_2", ... ] }, "observation.path_points": { "dtype": "float64", "shape": [ 36555, 3 ], "names": [ "x", "y", "z" ] }, "observation.path_colors": { "dtype": "float32", "shape": [ 36555, 3 ], "names": [ "r", "g", "b" ] }, "action": { "dtype": "float32", "shape": [ 4, 4 ], "names": [ "action_0_0", "action_0_1", "action_0_2", ... ] }, ... } ``` --- **4. tasks.jsonl** This file defines the unified high-level task associated with the current dataset. **Fields:** - `task_index`: The index of the task. - `task`: A natural language description of the task scenario, including the environment setup and overall objective. **Example:** ```json { "task_index": 0, "task": "Go straight to the hallway and then turn left. Go past the bed. Veer to the right and go through the white door. Stop when you're in the doorway." } ``` ## Convert to LeRobotDataset InternNav adopts the [LeRobot](https://github.com/huggingface/lerobot) format for all supported datasets. This section explains how to convert popular datasets — **VLN-CE** — into this format using our provided [conversion scripts](#). ### VLN-CE → LeRobot > VLN-CE (**V**ision and **L**anguage **N**avigation in **C**ontinuous **E**nvironments) is a benchmark for instruction-guided navigation task with crowdsourced instructions, realistic environments, and unconstrained agent navigation. [Download it here](https://github.com/jacobkrantz/VLN-CE). 1. **Download** our source code: ```shell # clone our repo git clone https://github.com/InternRobotics/InternNav.git # In trans_Lerobot env cd /scripts/dataset_converters/vlnce2lerobot # clone lerobot git clone -b user/michel-aractingi/2025-05-20-hil-rebase-robots https://github.com/huggingface/lerobot.git cd lerobot # Create a virtual environment with Python 3.10 and activate it conda create -y -n lerobot python=3.10 conda activate lerobot ``` 2. **Install** `ffmpeg` in your environment: ```shell conda install ffmpeg -c conda-forge #Additionally, you can also install ffmpeg using sudo: sudo apt update sudo apt install ffmpeg ``` 3. **Adapt** for InternNav and **Execute** the script: > To better accommodate the structure of navigation task datasets: > 1. Inherit from and extend the LeRobotDataset class by creating a new subclass called NavDataset. > 2. Inherit from and extend the LeRobotDatasetMetadata class by creating a new subclass called NavDatasetMetadata. ```shell python vlnce2lerobot.py \ --data_dir /your/path/vln \ # Root folder --datasets RxR \ # Which dataset split to process (RxR, R2R, …) --start_index 0 \ --end_index 2000 \ --repo_name vln_ce_lerobot \ --num_threads 10 ``` ## Collect Demonstration Dataset in GRUtopia Support for collecting demos via GRUtopia simulation is coming soon — stay tuned!