# Evaluation This document provides the guide of evaluation workflow and custom development. ## Table of Contents 1. [Overview](#overview) 2. [Evaluation Workflow](#evaluation-workflow) 3. [API Specifications](#api-specifications) 4. [Configuration Guide](#configuration-guide) 5. [Distributed Evaluation](#distributed-evaluation) 6. [Examples and Best Practices](#examples-and-best-practices) 7. [Troubleshooting](#troubleshooting) ## Overview The InterManip Evaluator is a modular robotic manipulation task evaluation system that supports multiple evaluation environments and agent types. The system adopts a plugin-based architecture, making it easy to extend with new evaluation environments and agents. ### Key Features - **Multi-Environment Support**: Supports SimplerEnv, Calvin, GenManip, and other evaluation environments - **Distributed Evaluation**: Ray-based distributed evaluation framework supporting multi-GPU parallel evaluation - **Client-Server Mode**: Supports agents running in server mode to avoid conflicts from software and hardware environments - **Flexible Configuration**: Pydantic-based configuration system with type checking and validation - **Extensible Architecture**: Easy to add new evaluators and environments ### System Architecture ``` scripts/eval/start_evaluator.py (Entry Point) ↓ Evaluator (Base Class, managing the EnvWrapper and BaseAgent) ↓ Specific Evaluators (SimplerEvaluator, CalvinEvaluator, GenManipEvaluator) ↓ Specific Environments (SimplerEnv, CalvinEnv, GenmanipEnv) + Specific Agents (Pi0Agent, Gr00t_N1_Agent, DPAgent, ...) ``` ### Supported benchmarks source codebase - **SimplerEnv**: internmanip/benchmarks/SimplerEnv - **Calvin**: internmanip/benchmarks/calvin - **GenManip**: internmanip/benchmarks/genmanip ## Evaluation Workflow ### 1. Starting Evaluation The evaluation system is launched through `start_evaluator.py`: ```bash # Basic evaluation python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py # Distributed evaluation python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed # Client-server mode python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --server # Combined modes python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed --server ``` Each evaluator follows this workflow: 1. **Initialization**: Load configuration, initialize environment and agent 2. **Data Loading**: Load evaluation tasks from configuration files or data sources 3. **Task Execution**: Execute evaluation tasks sequentially or in parallel 4. **Result Collection**: Collect execution results for each task 5. **Statistical Analysis**: Calculate success rate, or other metrics 6. **Result Persistence**: Save results to specified directory with timestamps ## API Specifications ### 1. Evaluator Base Class Interface #### Required Methods ```python class Evaluator: @classmethod def _get_all_episodes_setting_data(cls, episodes_config_path) -> List[Any]: """ Load all evaluation task configurations Args: episodes_config_path: Path to evaluation task configuration file(s) Returns: List[Any]: List of evaluation task configurations """ raise NotImplementedError def eval(self): """ Main evaluation method """ raise NotImplementedError ``` #### Optional Methods ```python def __init__(self, config: EvalCfg): """ Initialize evaluator Args: config: Evaluation configuration """ super().__init__(config) # Custom initialization logic @classmethod def _update_results(cls, result): """ Update evaluation results Args: result: Single task evaluation result """ raise NotImplementedError @classmethod def _print_and_save_results(cls): """ Print and save evaluation results """ raise NotImplementedError def _eval_single_episode(self, episode_data: Dict[str, Any]) -> Dict[str, Any]: """ Execute a single evaluation task Args: episode_data: Configuration data for a single task Returns: Dict[str, Any]: Task execution result """ raise NotImplementedError ``` ### 2. Configuration Interface Specification #### EvalCfg Configuration Structure ```python class EvalCfg(BaseModel): eval_type: str # Evaluator type agent: AgentCfg # Agent configuration env: EnvCfg # Environment configuration logging_dir: Optional[str] = None # Logging directory distributed_cfg: Optional[DistributedCfg] = None # Distributed configuration ``` ## Configuration Guide ### 1. Basic Configuration ```python eval_cfg = EvalCfg( eval_type="calvin", # Evaluator type agent=AgentCfg(...), # Agent configuration env=EnvCfg(...), # Environment configuration logging_dir="logs/eval/calvin", # Logging directory ) ``` ### 2. Agent Configuration ```python agent=AgentCfg( agent_type="seer", # Agent type model_name_or_path="path/to/model", # Model path model_kwargs={ # Model parameters "device_id": 0, "sequence_length": 10, "hidden_dim": 512, "num_layers": 6, "learning_rate": 1e-4, "batch_size": 32, # ... }, server_cfg=ServerCfg( # Server configuration (optional) server_host="localhost", server_port=5000, timeout=30, ), ) ``` ### 3. Environment Configuration ```python env=EnvCfg( env_type="calvin", # Environment type device_id=0, # Device ID config_path="path/to/config.yaml", # Environment config file env_settings=CalvinEnvSettings( # Environment-specific settings num_sequences=100, ) ) ``` ### 4. Distributed Configuration ```python distributed_cfg=DistributedCfg( num_workers=4, # Number of worker processes ray_head_ip="10.150.91.18", # Ray cluster head node IP include_dashboard=True, # Include dashboard dashboard_port=8265, # Dashboard port ) ``` ## [Optional] Distributed Evaluation ### 1. Ray Cluster Setup Distributed evaluation is based on the Ray framework, supporting the following deployment modes: - **Single Machine Multi-Process**: `ray_head_ip="auto"` - **Multi-Machine Cluster**: `ray_head_ip="10.150.91.18"` ### 2. Workflow [WIP] ## [Optional] Use Multiple Evaluators to Speed Up If you have sufficient resources, we also provide multi-process parallelization to speed up the evaluation. This feature is enabled by the [Ray](https://docs.ray.io/en/latest/ray-overview/getting-started.html) distributed framework, so it requires starting up a Ray cluster as the distributed backend. - Start up a Ray cluster on your machine(s): ```bash # Call `ray start --head` on your main machine. It will start a Ray cluster which includes all available CPUs/GPUs/memory by default. ray start --head [--include-dashboard=true] [--num-gpus=?] ``` - [Optional] Scale up your Ray cluster: ```bash # When the Ray cluster is ready, it will print the Ray cluster head IP address on the terminal. # If you have more than one machine and want to scale up your Ray cluster, execute the following command on the other machines: ray start --address='{your_ray_cluster_head_ip}:6379' ``` - Customize your [`DistributedCfg`](../../internmanip/configs/evaluator/distributed_cfg.py): ```python # Example configuration, `DistributedCfg` should be defined in the `EvalCfg` from internmanip.configs import * eval_cfg = EvalCfg( eval_type="calvin", agent=AgentCfg( ... ), env=EnvCfg( ... ), distributed_cfg=DistributedCfg( num_workers=4, # Usually equals to the number of GPUs ray_head_ip="10.150.91.18", # or "auto" if you are located at the Ray head node machine include_dashboard=True, # By default dashboard_port=8265, # By default ) ) ``` - Enable distributed evaluation mode before starting the evaluator pipeline: ```bash python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --distributed ``` - [Optional] View the task progress or resource monitor: The Ray framework provides a dashboard to view its task scheduling progress and resource usage. Access it on this address `{ray_head_ip}:8265`. ## [Optional] Enable the Client-Server Evaluation Mode In some cases, we need to enable the client-server evaluation mode to isolate the conda environments of the agents and simulator environments. - Customize the [`ServerCfg`](../../internmanip/configs/agent/server_cfg.py) first: ```python # Example configuration, `ServerCfg` should be defined inside the `AgentCfg` within the `EvalCfg` from internmanip.configs import * eval_cfg = EvalCfg( eval_type="calvin", agent=AgentCfg( ..., server_cfg=ServerCfg( server_host="localhost", server_port=5000, ), ), env=EnvCfg( ... ), ) ``` - Start a server process [NOTE: This step may be skipped in later codes]: ```bash python scripts/eval/start_policy_server.py ``` - Enable client-server evaluation mode before starting the evaluator pipeline: ```bash python scripts/eval/start_evaluator.py --config scripts/eval/configs/seer_on_calvin.py --server ```