🏃🏻‍♂️ Training and Evaluation#

This document guides you through:

Minimal validation. Verify that your environment and setup can successfully train a model on a small dataset.
Large-scale multi-node training. Learn how to finetune models on multiple GPUs and nodes.
Supported models and datasets. Get an overview of the built-in policies and datasets you can use.
Evaluate your trained models using closed-loop benchmarking, allowing you to measure the success rate (SR) on various tasks.
Extend the framework by adding your own custom benchmarks.

Minimal Validation Training#

We provide several built-in policies such as GR00T-N1, GR00T-N1.5, Pi-0, DP-CLIP, and ACT-CLIP. To quickly verify your setup, you can train the DP-CLIP model on the genmanip-demo dataset (300 demonstrations of the instruction “Move the milk carton to the top of the ceramic bowl”). This requires 1 GPU with at least 24GB memory:

torchrun --nnodes 1 --nproc_per_node 1 \       # number of processes per node, e.g., 1
   scripts/train/train.py \
   --config run_configs/train/dp_clip_genmanip_v1.yaml # Config file that specifies which model to train on which dataset, along with hyperparameters

😄 When you run the script, it will prompt you to log in to Weights & Biases (WandB). This integration allows you to monitor your training process in real time via the WandB dashboard.

The script will also automatically download all required models and datasets from Hugging Face into the Hugging Face cache directory (by default located at ~/.cache/huggingface/). If you’re concerned about storage space or want to customize the cache location, you can set the cache directory using an environment variable:

export HF_HOME=your/custom/cache/path

💡 Note! The download process may take some time depending on your network speed—please be patient.

⚠️ Common Issues#

Authentication Required: If you see an error related to missing access rights, make sure you’ve logged into Hugging Face CLI:
```
huggingface-cli login
```
403 Forbidden: Gated Repository Access: If you encounter the following error:
```
403 Forbidden: Please enable access to public gated repositories in your fine-grained token settings to view this repository.
```
Then ensure that your Hugging Face access token has the correct fine-grained permissions enabled for accessing gated repositories. You can verify and adjust these in your Hugging Face account’s Access Tokens settings.

Large-Scale Finetuning#

Single Node (Multi-GPU)#

To finetune a built-in model such as Pi-0 on the GenManip dataset using 8 GPUs, you can use the following srun command:

srun --job-name=pi0_genmanip --gres=gpu:8 --ntasks-per-node=1 \
torchrun \
   --nnodes 1 \
   --nproc_per_node 8 \
   scripts/train/train.py \
   --config run_configs/train/pi0_genmanip_v1.yaml

Multi-Node Multi-GPU (Slurm)#

We also provide Slurm scripts for multi-node training.

Step 1: Create train_pi0_genmanip_slurm.sh:

#!/bin/bash
set -e

export PYTHONPATH="$(pwd):$PYTHONPATH"
source .venv/pi0/bin/activate

master_addr=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)

export NCCL_SOCKET_IFNAME=$1
export NCCL_IB_HCA=$2

torchrun \
   --nnodes=$SLURM_NNODES \
   --nproc_per_node=8 \
   --node_rank=$SLURM_PROCID \
   --master_port=29500 --master_addr=$master_addr \
   scripts/train/train.py \
   --config run_configs/train/pi0_genmanip_v1.yaml

Step 2: Create multinode_submit.slurm:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1   # 1 task per node
#SBATCH --gpus-per-task=8     # 8 GPUs per task
srun bash train_pi0_genmanip_slurm.sh

Step 3: Start training:

sbatch multinode_submit.slurm

Customizing Training with Your Own YAML Config#

If you would like to train with your own choice of model and dataset, you can simply create a custom YAML configuration file and pass it to the --config argument in the training script.

For example, to train the pre-registered Pi-0 model on the GenManip dataset, a minimal YAML configuration might look like this:

model_type: pi0                        # Name of a pre-registered model
dataset_path: InternRobotics/InternData-GenmanipTest  # Can be a HuggingFace Hub ID or local path
data_config: genmanip_v1              # Pre-registered dataset configuration
base_model_path: lerobot/pi0          # (Optional) Overrides the model checkpoint path; will default to HF if omitted

💡 Notes:

model_type: Must match the name of a model that has already been registered within InternManip.
dataset_path: Can be a HuggingFace ID (e.g., InternRobotics/InternData-GenmanipTest) or a local directory where the dataset is downloaded.
data_config: Refers to a dataset configuration preset (e.g., for preprocessing or loading behavior), also pre-registered in the codebase.
base_model_path: This is optional. If the selected model_type is supported and known, InternManip will automatically resolve and download the correct checkpoint from HuggingFace. If you’ve already downloaded a model locally or want to use a custom one, you can specify the path here directly.

By editing or extending this YAML file, you can quickly try different models, datasets, or training setups — all without modifying the training script.

Available Models and Datasets#

When creating your own YAML config file for training or evaluation, you can directly refer to the following officially supported values:

Use values from the ${model_type} and ${base_model_path} columns below to populate the corresponding fields in your YAML.
Similarly, values from the ${data_config} and ${dataset_path} columns can be used to specify the dataset configuration and loading path.

The following are the supported models along with their HuggingFace IDs:

`${model_type}`	`${base_model_path}`
`pi0`	`lerobot/pi0`
`pi0fast`	`pi0fast_base`
`gr00t_n1`	`nvidia/GR00T-N1-2B`
`gr00t_n1_5`	`nvidia/GR00T-N1.5-3B`
`dp_clip`	`None`
`act_clip`	`None`

Below are the datasets officially integrated into InternManip:

`${data_config}`	`${dataset_path}`
`genmanip_v1`	`InternRobotics/InternData-GenmanipTest`
`calvin_abc`	`InternRobotics/InternData-Calvin_ABC`
`google_robot`	`InternRobotics/InternData-fractal20220817_data`
`bridgedata_v2`	`InternRobotics/InternData-BridgeV2`

Evaluation and Benchmarking (WIP)#

By default, the inference of model will be running in the main loop sharing the same process with the env. You can evaluate pi0 on the Genmanip benchmark in a single process using the following command:

python scripts/eval/start_evaluator.py \
   --config scripts/eval/config/pi0_on_genmanip.py

The terminal prints SR (Success Rate) information for each episode and task:

{
    "success_episodes": [
        {"task_name": "tasks/...", "episode_name": "010", "episode_sr": 1.0, ...}
    ],
    "failure_episodes": [],
    "success_rate": 1.0
}

You can view the images generated during evaluation in the eval_results directory.

You can modify the bash script according to your resource availability and requirements.

Available Benchmarks#

The following benchmarks are currently available for evaluation:

InternManip offers implementations of multiple manipulation policy models—GR00T-N1, GR00T-N1.5, Pi-0, DP-CLIP, and ACT-CLIP—as well as curated datasets including GenManip, Simpler-Env, and CALVIN, all organized in the standardized LeRobot format.

The available ${MODEL}, ${DATASET}, ${BENCHMARK} and their results are summarized in the following tables:

CALVIN (ABC-D) Benchmark#

Model	Dataset/Benchmark	Score (Main Metric)	Model Weights
`gr00t_n1`	`calvin_abcd`
`gr00t_n1_5`	`calvin_abcd`
`pi0`	`calvin_abcd`
`dp_clip`	`calvin_abcd`
`act_clip`	`calvin_abcd`

Simpler-Env Benchmark#

Model	Dataset/Benchmark	Success Rate	Model Weights
`gr00t_n1`	`google_robot`
`gr00t_n1_5`	`google_robot`
`pi0`	`google_robot`
`dp_clip`	`google_robot`
`act_clip`	`google_robot`
`gr00t_n1`	`bridgedata_v2`
`gr00t_n1_5`	`bridgedata_v2`
`pi0`	`bridgedata_v2`
`dp_clip`	`bridgedata_v2`
`act_clip`	`bridgedata_v2`

Genmanip Benchmark#

Model	Dataset/Benchmark	Success Rate	Model Weights
`gr00t_n1`	`genmanip_v1`
`gr00t_n1_5`	`genmanip_v1`
`pi0`	`genmanip_v1`
`dp_clip`	`genmanip_v1`
`act_clip`	`genmanip_v1`

What’s Next?#

Now that you’ve completed the training and evaluation process, you may want to incorporate your own dataset, model, or benchmark. To do so, please refer to the following guides:

📁 How to customize your dataset – Learn how to prepare and register your dataset for training or evaluation.
🧠 How to add a model – Learn how to integrate your own model into InternManip’s training pipeline.
🧪 How to add your benchmark – Learn how to implement and register a new evaluation benchmark.

Once you’ve set them up, you can follow the same command structures used above—just replace the relevant configuration entries (e.g., --config) with your custom definitions.