Evaluation#
This document provides the guide of evaluation workflow and custom development.
Table of Contents#
Overview#
The InterManip Evaluator is a modular robotic manipulation task evaluation system that supports multiple evaluation environments and agent types. The system adopts a plugin-based architecture, making it easy to extend with new evaluation environments and agents.
Key Features#
Multi-Environment Support: Supports SimplerEnv, Calvin, GenManip, and other evaluation environments
Distributed Evaluation: Ray-based distributed evaluation framework supporting multi-GPU parallel evaluation
Client-Server Mode: Supports agents running in server mode to avoid conflicts from software and hardware environments
Flexible Configuration: Pydantic-based configuration system with type checking and validation
Extensible Architecture: Easy to add new evaluators and environments
System Architecture#
scripts/eval/start_evaluator.py (Entry Point)
↓
Evaluator (Base Class, managing the EnvWrapper and BaseAgent)
↓
Specific Evaluators (SimplerEvaluator, CalvinEvaluator, GenManipEvaluator)
↓
Specific Environments (SimplerEnv, CalvinEnv, GenmanipEnv) + Specific Agents (Pi0Agent, Gr00t_N1_Agent, DPAgent, ...)
Supported benchmarks source codebase#
SimplerEnv: internmanip/benchmarks/SimplerEnv
Calvin: internmanip/benchmarks/calvin
GenManip: internmanip/benchmarks/genmanip
Evaluation Workflow#
1. Starting Evaluation#
The evaluation system is launched through start_evaluator.py
:
# Basic evaluation
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py
# Distributed evaluation
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed
# Client-server mode
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --server
# Combined modes
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed --server
Each evaluator follows this workflow:
Initialization: Load configuration, initialize environment and agent
Data Loading: Load evaluation tasks from configuration files or data sources
Task Execution: Execute evaluation tasks sequentially or in parallel
Result Collection: Collect execution results for each task
Statistical Analysis: Calculate success rate, or other metrics
Result Persistence: Save results to specified directory with timestamps
API Specifications#
1. Evaluator Base Class Interface#
Required Methods#
class Evaluator:
@classmethod
def _get_all_episodes_setting_data(cls, episodes_config_path) -> List[Any]:
"""
Load all evaluation task configurations
Args:
episodes_config_path: Path to evaluation task configuration file(s)
Returns:
List[Any]: List of evaluation task configurations
"""
raise NotImplementedError
def eval(self):
"""
Main evaluation method
"""
raise NotImplementedError
Optional Methods#
def __init__(self, config: EvalCfg):
"""
Initialize evaluator
Args:
config: Evaluation configuration
"""
super().__init__(config)
# Custom initialization logic
@classmethod
def _update_results(cls, result):
"""
Update evaluation results
Args:
result: Single task evaluation result
"""
raise NotImplementedError
@classmethod
def _print_and_save_results(cls):
"""
Print and save evaluation results
"""
raise NotImplementedError
def _eval_single_episode(self, episode_data: Dict[str, Any]) -> Dict[str, Any]:
"""
Execute a single evaluation task
Args:
episode_data: Configuration data for a single task
Returns:
Dict[str, Any]: Task execution result
"""
raise NotImplementedError
2. Configuration Interface Specification#
EvalCfg Configuration Structure#
class EvalCfg(BaseModel):
eval_type: str # Evaluator type
agent: AgentCfg # Agent configuration
env: EnvCfg # Environment configuration
logging_dir: Optional[str] = None # Logging directory
distributed_cfg: Optional[DistributedCfg] = None # Distributed configuration
Configuration Guide#
1. Basic Configuration#
eval_cfg = EvalCfg(
eval_type="calvin", # Evaluator type
agent=AgentCfg(...), # Agent configuration
env=EnvCfg(...), # Environment configuration
logging_dir="logs/eval/calvin", # Logging directory
)
2. Agent Configuration#
agent=AgentCfg(
agent_type="seer", # Agent type
model_name_or_path="path/to/model", # Model path
model_kwargs={ # Model parameters
"device_id": 0,
"sequence_length": 10,
"hidden_dim": 512,
"num_layers": 6,
"learning_rate": 1e-4,
"batch_size": 32,
# ...
},
server_cfg=ServerCfg( # Server configuration (optional)
server_host="localhost",
server_port=5000,
timeout=30,
),
)
3. Environment Configuration#
env=EnvCfg(
env_type="calvin", # Environment type
device_id=0, # Device ID
config_path="path/to/config.yaml", # Environment config file
env_settings=CalvinEnvSettings( # Environment-specific settings
num_sequences=100,
)
)
4. Distributed Configuration#
distributed_cfg=DistributedCfg(
num_workers=4, # Number of worker processes
ray_head_ip="10.150.91.18", # Ray cluster head node IP
include_dashboard=True, # Include dashboard
dashboard_port=8265, # Dashboard port
)
[Optional] Distributed Evaluation#
1. Ray Cluster Setup#
Distributed evaluation is based on the Ray framework, supporting the following deployment modes:
Single Machine Multi-Process:
ray_head_ip="auto"
Multi-Machine Cluster:
ray_head_ip="10.150.91.18"
2. Workflow#
[WIP]
[Optional] Use Multiple Evaluators to Speed Up#
If you have sufficient resources, we also provide multi-process parallelization to speed up the evaluation. This feature is enabled by the Ray distributed framework, so it requires starting up a Ray cluster as the distributed backend.
Start up a Ray cluster on your machine(s):
# Call `ray start --head` on your main machine. It will start a Ray cluster which includes all available CPUs/GPUs/memory by default.
ray start --head [--include-dashboard=true] [--num-gpus=?]
[Optional] Scale up your Ray cluster:
# When the Ray cluster is ready, it will print the Ray cluster head IP address on the terminal.
# If you have more than one machine and want to scale up your Ray cluster, execute the following command on the other machines:
ray start --address='{your_ray_cluster_head_ip}:6379'
Customize your
DistributedCfg
:
# Example configuration, `DistributedCfg` should be defined in the `EvalCfg`
from internmanip.configs import *
eval_cfg = EvalCfg(
eval_type="calvin",
agent=AgentCfg(
...
),
env=EnvCfg(
...
),
distributed_cfg=DistributedCfg(
num_workers=4, # Usually equals to the number of GPUs
ray_head_ip="10.150.91.18", # or "auto" if you are located at the Ray head node machine
include_dashboard=True, # By default
dashboard_port=8265, # By default
)
)
Enable distributed evaluation mode before starting the evaluator pipeline:
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed
[Optional] View the task progress or resource monitor:
The Ray framework provides a dashboard to view its task scheduling progress and resource usage. Access it on this address {ray_head_ip}:8265
.
[Optional] Enable the Client-Server Evaluation Mode#
In some cases, we need to enable the client-server evaluation mode to isolate the conda environments of the agents and simulator environments.
Customize the
ServerCfg
first:
# Example configuration, `ServerCfg` should be defined inside the `AgentCfg` within the `EvalCfg`
from internmanip.configs import *
eval_cfg = EvalCfg(
eval_type="calvin",
agent=AgentCfg(
...,
server_cfg=ServerCfg(
server_host="localhost",
server_port=5000,
),
),
env=EnvCfg(
...
),
)
Start a server process [NOTE: This step may be skipped in later codes]:
python scripts/eval/start_policy_server.py
Enable client-server evaluation mode before starting the evaluator pipeline:
python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --server