EBench Docs
Challenge
The EBench Challenge supports online submission for benchmark results. Follow the steps below to prepare a valid run and submit it to the leaderboard service.
Baseline and Getting Started
Section titled “Baseline and Getting Started”Before submitting online, make sure you can run the benchmark locally:
- Set up the server and client environments in Environment Setup.
- Prepare the required benchmark assets from Asset & Dataset.
- Run a local benchmark first with Run Evaluation.
- If you use your own policy, follow Integrate Your Own Model.
You should verify that your local run finishes normally and produces a complete result directory before attempting online submission.
Online Submit Steps
Section titled “Online Submit Steps”The online workflow has three stages: create an online task, wait for the evaluation endpoint, and run evaluation workers against that endpoint.
1. Get your token
Section titled “1. Get your token”Open the platform landing page:
https://internrobotics.shlab.org.cn/eval/landing-pageThen:
- Sign in to the platform.
- Open the API key or secret management page from the top-right corner.
- Create a new API key and copy the token value.
2. Prepare the client environment
Section titled “2. Prepare the client environment”git clone https://gitee.pjlab.org.cn/L2/MultimodalVLA/GenManip-Client.gitcd GenManip-Clientconda create -n client python=3.11 -yconda activate clientpip install -e .3. Create an online evaluation task
Section titled “3. Create an online evaluation task”Use gmp online submit to request a remote evaluation job:
gmp online submit \ --base_url https://internrobotics.shlab.org.cn/eval \ --token "$EBENCH_SUBMIT_TOKEN" \ --task_id "$PREVIOUS_TASK" \ # optional: continue with a previous task --benchmark_set ebench_generalist \ --model_name internVLA \ --model_type VLA \ --submitter_name test \ --submitter_homepage test \ --is_public 0Parameters
Section titled “Parameters”| Parameter | Type | Example | Description |
|---|---|---|---|
| task_id | string | T2025123100001 | Optional, can include previous task_id for task re-execution |
| model_name | string | internVLA | Model name |
| model_type | string | VLA | Model type |
| benchmark_set | string | ebench_generalist | Benchmark set type, currently only ebench_generalist is allowed |
| submitter_name | string | SHlab | Organization/developer name |
| submitter_homepage | string | http://example.com | Submitter homepage |
| is_public | int | 0 | Whether to make public 0 No 1 Yes |
After the backend task is ready, the command returns fields like:
Waiting for available server (task_id=b5dddc6de60c4aec8236500b8e3dc0e1)...Still waiting... elapsed 0.1s. Next check in 5.0s.Still waiting... elapsed 5.3s. Next check in 5.0s.Ready after 10.4s. endpoint=https://internverse.shlab.org.cn/eval-server/2813aea1/api/predict/embodied_eval.genmanip_eas_1_master_prod{ "task_id": "b5dddc6de60c4aec8236500b8e3dc0e1", "endpoint": "https://internverse.shlab.org.cn/eval-server/2813aea1/api/predict/embodied_eval.genmanip_eas_1_master_prod", "response": { "code": 0, "msg": "success", "trace_id": "4a4136c66bdc80922ccc6485c44fa9e5", "data": { "ready": true, "endpoint": "https://internverse.shlab.org.cn/eval-server/2813aea1/api/predict/embodied_eval.genmanip_eas_1_master_prod" } }}Record both values:
task_id: use this as therun_idwhen running evaluation.endpoint: use this as the remote evaluation URL.
Demo: Auto extract endpoint and task_id
Section titled “Demo: Auto extract endpoint and task_id”The following example uses a simplified Python script to run gmp online submit and extract endpoint and task_id from the returned output:
import osimport jsonimport subprocess
def submit_online_task() -> tuple[str, str]: cmd = [ 'gmp', 'online', 'submit', '--base_url', 'https://internrobotics.shlab.org.cn/eval', '--token', os.environ['EBENCH_SUBMIT_TOKEN'], '--benchmark_set', 'ebench_generalist', '--model_name', 'internVLA', '--model_type', 'VLA', '--submitter_name', 'test', '--submitter_homepage', 'test', '--is_public', '0', ]
result = subprocess.run(cmd, capture_output=True, text=True, check=True) output = result.stdout json_start = output.find('{') payload = json.loads(output[json_start:]) endpoint = payload['endpoint'] task_id = payload['task_id'] print('endpoint=' + endpoint) print('task_id=' + task_id) return endpoint, task_idAfter running the script, it will print endpoint and task_id directly, which you can then use in the subsequent evaluation worker call.
4. Start evaluation workers
Section titled “4. Start evaluation workers”Run the evaluator against the returned endpoint. This is a test evaluation. Follow the doc to reate your own model evaluation.
endpoint, task_id = submit_online_task()
client = EvalClient( base_url=endpoint, token=os.environ['EBENCH_SUBMIT_TOKEN'], run_id=task_id, worker_ids=["0"])model = ModelClient(...)
try: obs = client.reset() done = False while not done: # Generate actions for entire chunk action_chunk = model.get_action_chunk(obs) # Server executes chunk internally; returns obs at next re-inference point obs, done = client.step(action_chunk)finally: client.close()You can start several eval client with different ids. i.e.
client = EvalClient( base_url=endpoint, token=os.environ['EBENCH_SUBMIT_TOKEN'], run_id=task_id, worker_ids=["1"])...The server supports up to 16 concurrent workers per run. Connections will be terminated after ten minutes of inactivity. You can restart a failed evaluation submission by using the same task_id.
# restart above taskgmp online submit \ --base_url https://internrobotics.shlab.org.cn/eval \ --token "$EBENCH_SUBMIT_TOKEN" \ --task_id 9ea5fb6ae980430da626958c4433ea18 \ # ...If you encounter connection timeouts, restart the client to restart. The progress will be saved on server.
5. Monitor the task
Section titled “5. Monitor the task”After the online task is created, the platform page will show the corresponding task. The final evaluation outputs are written to the same remote task record.
You can check the server status and task progress from terminal as well.
gmp status \ --url "$EBENCH_ONLINE_ENDPOINT" \ --token "$EBENCH_SUBMIT_TOKEN" \ --run_id "$EBENCH_TASK_ID"6. Stop the task
Section titled “6. Stop the task”Stop an evaluation session by
gmp online stop \ --url "$EBENCH_ONLINE_ENDPOINT" \ --token "$EBENCH_SUBMIT_TOKEN" \ --run_id "$EBENCH_TASK_ID" \ --user_id "$USER_ID" # get from the website, Your account pageScoring Rules
Section titled “Scoring Rules”- Each evaluated episode produces a task score between
0.0and1.0. - A task receives full score when the required goal condition is completed within the episode; otherwise it receives
0.0. - The leaderboard score is the average task score across the evaluated episodes in the submitted benchmark set.
- For task-specific success semantics, see Task Showcase, where each task includes its
Location,Instruction, andScoredescription.
Example Checklist
Section titled “Example Checklist”- Baseline or custom model runs locally
- Correct benchmark track and split selected
- Submission token configured
- Online submit URL confirmed
- Result files ready for upload