Dataset#
This section introduces how to prepare, organize, and convert datasets into the unified LeRobotDataset format used by InternManip. Youโll learn:
These steps ensure compatibility with our training and evaluation framework across all supported benchmarks.
Data Structure & Format Specification#
Dataset Layout#
InternManip adopts the LeRobotDataset
format, which standardizes how videos, instructions, actions, and metadata are organized.
Each episode is stored in both .parquet
format for structured access and .mp4
for visualization, and chunked in groups of 1000 episodes for efficient loading.
Directory Overview:
<datasets_root>
โ
โโโ <sub_dataset_x> # Each sub-dataset (e.g., GenManip-V1)
โ โโโ data # Episodes in .parquet format, split into chunks (1,000 per folder)
โ โ โโโ chunk-000
โ โ โ โโโ episode_000000.parquet
โ โ โ โโโ ...
โ โ โโโ chunk-00n
โ โ
โ โโโ meta # Metadata and statistics
โ โ โโโ episodes.jsonl
โ โ โโโ info.json
โ โ โโโ tasks.jsonl
โ โ โโโ modality.json
โ โ โโโ episode_modality.jsonl (optional)
โ โ โโโ stats.json
โ โ
โ โโโ videos # Multi-view videos aligned with episodes
โ โโโ chunk-000
โ โ โโโ observation.images.head
โ โ โโโ ...
โ โโโ chunk-00n
โ
โโโ ... # More sub-datasets
Metadata Files (Inside meta/
)#
The meta/
folder contains critical metadata and statistics that power training, evaluation, and debugging.
1. episodes.jsonl
This file stores per-episode metadata for a specific task split (e.g., task_0
, task_1
, etc.). Each line represents one episode and includes basic information used by the training framework.
Fields:
episode_index
: A unique identifier for the episode.tasks
: A list of high-level task descriptions. Each description defines the goal of the episode and should match the corresponding entry intasks.jsonl
.length
: The total number of frames in this episode.
This file serves as the primary index of available episodes and their corresponding task goals.
Example:
{
"episode_index": 0,
"tasks": [
"Sort personal care products | A robot stands in front of a table, on which there is a 3-grid material frame and a target frame, with the target frame on the left and the 3-grid material frame on the right."
],
"length": 1240
}
2. info.json
This file contains metadata shared across the entire task_n
split. It summarizes key information about the dataset, including device specifications, recording configuration, and feature schemas.
Fields:
codebase_version
: The version of the data format used (e.g.,"v2.1"
).robot_type
: The robot hardware platform used for data collection (e.g.,"a2d"
).total_episodes
: Total number of episodes available in this task split.total_frames
: Total number of frames across all episodes.total_tasks
: Number of distinct high-level tasks (usually 1 for each task_n).total_videos
: Total number of video files in this task split.total_chunks
: Number of data chunks (each chunk typically contains โค1000 episodes).chunks_size
: The size of each chunk (usually 1000).fps
: Frames per second used for both video and robot state collection.splits
: Dataset split definitions (e.g.,"train": "0:503"
).data_path
: Pattern for loading parquet files, whereepisode_chunk
andepisode_index
are formatted.video_path
: Pattern for loading video files corresponding to each camera.features
: The structure and semantics of all recorded data streams.
Specifically, The features
includes four categories:
Video / Image Features Each entry includes:
dtype
:"video"
shape
: Spatial resolution and channelsnames
: Axis names (e.g.,["height", "width", "rgb"]
)info
: Video-specific metadata such as codec, resolution, format, and whether it contains depth or audio.
Observation Features These include robot sensor readings (e.g., joint positions, effector poses) and follow this schema:
dtype
: Data type (e.g.,"float32"
)shape
: Tensor shapenames
: Dictionary describing physical meaning, usually under"motors"
Action Features These specify target actuator values or commands and have the same format as observation features.
Other Features These include auxiliary info such as timestamps, frame/episode indices, etc. Some of them (e.g.,
timestamp
) are currently unused but included for completeness.
Example Snippet (simplified)๏ผ
{
"codebase_version": "v2.1",
"robot_type": "a2d",
"total_episodes": 503,
"total_frames": 564350,
"fps": 30,
"features": {
"observation.images.head": {
"dtype": "video",
"shape": [480, 640, 3],
"names": ["height", "width", "rgb"],
"info": {
"video.fps": 30.0,
"video.codec": "av1"
}
},
"observation.states.joint.position": {
"dtype": "float32",
"shape": [14],
"names": {
"motors": ["left_arm_0", ..., "right_arm_6"]
}
},
"actions.end.position": {
"dtype": "float32",
"shape": [2, 3],
"names": {
"motors": ["left_xyz", "right_xyz"]
}
},
"timestamp": {
"dtype": "float32",
"shape": [1],
"names": null
}
}
}
3. tasks.jsonl
This file defines the unified high-level task associated with the current dataset.
Fields:
task_index
: The index of the task.task
: A natural language description of the task scenario, including the environment setup and overall objective.
Example:
{
"task_index": 0,
"task": "Sort personal care products | A robot stands in front of a table, on which there is a 3-grid material frame and a target frame, with the target frame on the left and the 3-grid material frame on the right."
}
4. modality.json
The modality.json
file defines the mapping between keys in the raw dataset and the keys used by the model or training pipeline.
This mapping allows the training and evaluation code to work with consistent and unified feature names across different datasets, even if the raw data schemas differ.
The top-level keys of modality.json are fixed and must include:
state
: Describes all observed robot states, such as joint positions, gripper status, or end-effector pose.action
: Describes control commands or target values (e.g., delta joint values, delta end-effector positions).video
: Maps video streams or camera views.annotation
(optional): Contains human annotations or labels, such as task descriptions or validity flags.
Each sub-key under state or action defines:
start
(int, optional): The start index of this feature in its concatenated vector.end
(int, optional): The end index (exclusive).original_key
(string, optional): The corresponding key in the raw dataset. If omitted, no mapping is performed.
These indices are useful when packing multiple features (e.g., joint positions and gripper state) into a single tensor.
Example:
{
"state": {
"joints": {
"start": 0,
"end": 7
},
},
"action": {
"joints": {
"start": 0,
"end": 7
},
},
"video": {
"ego_view": {
"original_key": "video.ego_view"
},
},
"annotation": {
"human.action.task_description": {},
}
}
5. episodes_stats.jsonl
(Optional)
This file stores per-episode, per-feature statistical metadata for a specific dataset. Each line in the JSONL file corresponds to a single episode and contains the following:
episode_index
: The index of the episode within the dataset.stats
: A nested dictionary mapping each feature (e.g., RGB images, joint positions, robot states, actions) to its statistical summary:min
: Minimum value for the feature across all timesteps in the episode.max
: Maximum value.mean
: Mean value.std
: Standard deviation.count
: Number of frames in the episode that the feature was observed.
โ ๏ธ Note: The dimensions of
min
,max
,mean
, andstd
match the dimensionality of the original feature, whilecount
is a scalar.
Example entry:
{
"episode_index": 0,
"stats": {
"observation.images.head": {
"min": [[[x]], [[x]], [[x]]],
"max": [[[x]], [[x]], [[x]]],
"mean": [[[x]], [[x]], [[x]]],
"std": [[[x]], [[x]], [[x]]],
"count": [300]
},
"observation.states.joint.position": {
"min": [x, x, ..., x],
"max": [x, x, ..., x],
"mean": [x, x, ..., x],
"std": [x, x, ..., x],
"count": [300]
},
...
}
}
6. stats.json
This file stores global statistical information about all features in a dataset. These statistics are primarily used for:
Normalization โ e.g., standardizing features to zero mean and unit variance.
Outlier Detection โ using min/max or percentile values (q01, q99) to identify anomalies.
Example:
{
"actions.end.position": {
"mean": [ ... ],
"std": [ ... ],
"min": [ ... ],
"max": [ ... ],
"q01": [ ... ],
"q99": [ ... ]
},
Convert to LeRobotDataset#
InternManip adopts the LeRobot format for all supported datasets. This section explains how to convert popular datasets โ CALVIN and SimplerEnv โ into this format using our provided conversion scripts.
Youโll learn:
How to convert CALVIN using RLDS โ LeRobot pipeline
How to quickly adapt SimplerEnv datasets
How to adjust
.parquet
structure and metadata
CALVIN โ RLDS โ LeRobot#
CALVIN (Composing Actions from Language and Vision) is a benchmark for long-horizon language-conditioned tasks. Download it here. To reduce possible IO congestion and other issues during conversion, it is necessary to first convert the data to RLDS format, and then convert RLDS to Lerobot format.
Step 1: Convert CALVIN to RLDS#
RLDS is an episodic data format for sequential decision-making. See RLDS repo. This step refers to this code repository.
First create a conda environment using the provided environment_convert.yml
.
cd grmanipulation/scripts/data_convert
conda env create -f environment_convert.yml
conda activate trans_Lerobot
# Clone converter
git clone https://github.com/mees/calvin_rlds_converter.git
Follow conversion steps in that repo to produce RLDS-format data.
Step 2: RLDS โ LeRobot#
This step refers to this code repository.
Download our source code:
# clone our repo git clone https://gitlab.pjlab.org.cn/EmbodiedAI/SimPlatform/grmanipulation.git # In trans_Lerobot env cd /grmanipulation/scripts/data_convert/calvin2lerobot # clone lerobot git clone -b user/michel-aractingi/2025-05-20-hil-rebase-robots https://github.com/huggingface/lerobot.git cd lerobot pip install -e .
Install
ffmpeg
in your environment:conda install ffmpeg -c conda-forge #Additionally, you can also install ffmpeg using sudo๏ผ sudo apt update sudo apt install ffmpeg
Modify
lerobot/lerobot/common/datasets/video_utils.py
in line 245๏ผfrom collections import OrderedDict import subprocess from pathlib import Path FFMPEG_BIN = "/bin/ffmpeg" # modify this to your FFmpeg installation path def encode_video_frames( imgs_dir: Path | str, video_path: Path | str, fps: int, vcodec: str = "libx264", # default: libx264 pix_fmt: str = "yuv420p", g: int | None = 2, crf: int | None = 30, fast_decode: int = 0, log_level: str | None = "error", overwrite: bool = False, ) -> None: """More info on ffmpeg arguments tuning on `benchmark/video/README.md`""" video_path = Path(video_path) imgs_dir = Path(imgs_dir) video_path.parent.mkdir(parents=True, exist_ok=True) ffmpeg_args = OrderedDict( [ ("-f", "image2"), ("-r", str(fps)), ("-i", str(imgs_dir / "frame_%06d.png")), ("-vcodec", vcodec), ("-pix_fmt", pix_fmt), ] ) if g is not None: ffmpeg_args["-g"] = str(g) if crf is not None: ffmpeg_args["-crf"] = str(crf) if fast_decode: key = "-tune" value = "fastdecode" ffmpeg_args[key] = value if log_level is not None: ffmpeg_args["-loglevel"] = str(log_level) ffmpeg_args = [item for pair in ffmpeg_args.items() for item in pair] if overwrite: ffmpeg_args.append("-y") # ffempeg path ffmpeg_cmd = [FFMPEG_BIN] + ffmpeg_args + [str(video_path)] subprocess.run(ffmpeg_cmd, check=True, stdin=subprocess.DEVNULL) if not video_path.exists(): raise OSError( f"Video encoding did not work. File not found: {video_path}. " f"Try running the command manually to debug: `{' '.join(ffmpeg_cmd)}`" )
Modify path in
convert.sh
cd .. cd /rlds2lerobot/openx2lerobot # Modify paths python openx_rlds.py \ --raw-dir /path/to/droid/1.0.0 \ --local-dir /path/to/LEROBOT_DATASET \ --repo-id your_hf_id \ --use-videos
Execute the script:
bash convert.sh
Step 3: Adapt for InternManip:#
After completing the above two steps, a standard Calvin lerobot dataset can be obtained. However, if the training framework we provide is to be used, modifications need to be made to the names of some keys and file names.
Rename the file names in the video folder:
cd /../../calvin2lerobot/src python trans_lerobot_train_name.py
Modify the Data structure in the
.parquet
:python trans_lerobot_train_data.py
This script can help you calculate the Global dataset statistics (mean, std, min, max, quantiles) :
python trans_lerobot_train_name.py
๐ Final meta/
folder should include:
meta/
โโโ episodes.jsonl
โโโ info.json
โโโ tasks.jsonl
โโโ modality.json
โโโ stats.json
SimplerEnv โ LeRobot#
Step 1: Download pre-converted datasets#
We recommend directly downloading the dataset that has been converted. In this way, only simple processing of the dataset is needed to adapt it to our training framework.
# bridge_orig_lerobot
https://huggingface.co/datasets/IPEC-COMMUNITY/bridge_orig_lerobot
# fractal20220817_data_lerobot
https://huggingface.co/datasets/IPEC-COMMUNITY/fractal20220817_data_lerobot
Step 2: Adapt to InternManip#
Due to the high decoding overhead of AV1-encoded videos, which may cause issues like NCCL communication blocking, it is recommended to convert them to H.264 using the provided script๏ผ
# Configure FFmpeg (as described above) conda activate trans_lerobot cd /grmanipulation/scripts/data_convert/simplerEnv2lerobot/utils # For the dataset bridge_orig_lerobot, run: python trans_widowx_video.py # To adapt to our training framework, the video needs to be renamed. However, if you want to follow the default naming convention used by the LeRobot community, you can choose not to do so. python trans_widowx_video_name.py # For the dataset fractal20220817_data_lerobot, run: python trans_google_video.py # To adapt to our training framework, the video needs to be renamed. However, if you want to follow the default naming convention used by the LeRobot community, you can choose not to do so. python trans_google_video_name.py
Modify the Data structure in the
.parquet
:# For the dataset bridge_orig_lerobot, run: python trans_widowx_data.py # For the dataset fractal20220817_data_lerobot, run: python trans_google_data.py
๐ Final meta/
folder usage:
stats.json
,tasks.json
,episodes.jsonl
: โ ready-to-useinfo.json
,modality.json
: use the provided ones in./simplerEnv2lerobot/meta
Collect Demonstration Dataset in InternUtopia#
Support for collecting demos via InternUtopia simulation is coming soon โ stay tuned!