πŸ“¦ Add a New Dataset#

This section explains how to register and add a custom dataset with the InternManip framework. The process involves two main steps: ensuring the dataset format and registering it in code.

Dataset Structure#

All datasets must follow the LeRobotDataset Format to ensure compatibility with the data loaders and training pipelines. The expected structure is:

<your_dataset_root>  # Root directory of your dataset
β”‚
β”œβ”€β”€ data  # Structured episode data in .parquet format
β”‚   β”‚
β”‚   β”œβ”€β”€ chunk-000  # Episodes 000000 - 000999
β”‚   β”‚   β”œβ”€β”€ episode_000000.parquet
β”‚   β”‚   β”œβ”€β”€ episode_000001.parquet
β”‚   β”‚   └── ...
β”‚   β”‚
β”‚   β”œβ”€β”€ chunk-001  # Episodes 001000 - 001999
β”‚   β”‚   └── ...
β”‚   β”‚
β”‚   β”œβ”€β”€ ...
β”‚   β”‚
β”‚   └── chunk-00n  # Follows the same convention (1,000 episodes per chunk)
β”‚       └── ...
β”‚
β”œβ”€β”€ meta  # Metadata and statistical information
β”‚   β”œβ”€β”€ episodes.jsonl         # Per-episode metadata (length, subtask, etc.)
β”‚   β”œβ”€β”€ info.json              # Dataset-level information
β”‚   β”œβ”€β”€ tasks.jsonl            # Task definitions
β”‚   β”œβ”€β”€ modality.json          # Key dimensions and mapping information for each modality
β”‚   └── stats.json             # Global dataset statistics (mean, std, min, max, quantiles)
β”‚
└── videos  # Multi-view videos for each episode
    β”‚
    β”œβ”€β”€ chunk-000  # Videos for episodes 000000 - 000999
    β”‚   β”œβ”€β”€ observation.images.head       # Head (main front-view) camera
    β”‚   β”‚   β”œβ”€β”€ episode_000000.mp4
    β”‚   β”‚   └── ...
    β”‚   β”œβ”€β”€ observation.images.hand_left  # Left hand camera
    β”‚   └── observation.images.hand_right # Right hand camera
    β”‚
    β”œβ”€β”€ chunk-001  # Videos for episodes 001000 - 001999
    β”‚
    β”œβ”€β”€ ...
    β”‚
    └── chunk-00n  # Follows the same naming and structure

πŸ’‘ Note: For more detailed tutorials, please refer to the Dataset section.

This separation of raw data, video files, and metadata makes it easier to standardize transformations and modality handling across different datasets.

Implementation Steps#

Register a Dataset Class#

Create a new dataset class under internmanip/datasets/, inheriting from LeRobotDataset:

from internmanip.datasets import LeRobotDataset

class CustomDataset(LeRobotDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def load_data(self):
        # Implement custom data loading logic here
        pass

This class defines how to read your dataset’s raw files and convert them into a standardized format for training.

Define a Data Configuration#

Each dataset needs a data configuration class that specifies modalities, keys, and transformations. Create a new configuration file under internmanip/configs/data_configs/. Here’s a minimal example:

class CustomDataConfig(BaseDataConfig):
    """Data configuration for the custom dataset."""
    video_keys = ["video.rgb"]
    state_keys = ["state.pos"]
    action_keys = ["action.delta_pos"]
    language_keys = ["annotation.instruction"]

    # Temporal indices
    observation_indices = [0]         # Current timestep for observations
    action_indices = list(range(16))  # Future timesteps for actions (0-15)

    def modality_config(self) -> dict[str, ModalityConfig]:
        """Define modality configurations."""
        return {
            "video": ModalityConfig(self.observation_indices, self.video_keys),
            "state": ModalityConfig(self.observation_indices, self.state_keys),
            "action": ModalityConfig(self.action_indices, self.action_keys),
            "language": ModalityConfig(self.observation_indices, self.language_keys),
        }

    def transform(self):
        """Define preprocessing pipelines."""
        return [
            # Video preprocessing
            VideoToTensor(apply_to=self.video_keys),
            VideoResize(apply_to=self.video_keys, height=224, width=224),

            # State preprocessing
            StateActionToTensor(apply_to=self.state_keys),
            StateActionTransform(
                apply_to=self.state_keys,
                normalization_modes={"state.pos": "mean_std"},
            ),

            # Action preprocessing
            StateActionToTensor(apply_to=self.action_keys),
            StateActionTransform(
                apply_to=self.action_keys,
                normalization_modes={"action.delta_pos": "mean_std"},
            ),

            # Concatenate modalities
            ConcatTransform(
                video_concat_order=self.video_keys,
                state_concat_order=self.state_keys,
                action_concat_order=self.action_keys,
            ),
        ]

Register Your Config#

Finally, register your custom config by adding it to DATA_CONFIG_MAP.

DATA_CONFIG_MAP = {
    ...,
    "custom": CustomDataConfig(),
}

πŸ’‘ Tips: Adjust the key names (video_keys, state_keys, etc.) and normalization_modes based on your dataset. For multi-view video or multi-joint actions, just add more keys and update the transforms accordingly.

This config sets up how to load and process different modalities, and ensures compatibility with the training framework.

What’s Next?#

After registration, you can use your dataset by passing --dataset_path <path> and --data_config custom to the training YAML file.