Skip to content
English

EBench Docs

Benchmark Overview

A quick orientation to what EBench evaluates and how scores are computed. Use this page as a map — follow the links into other pages for setup or implementation details.

  • Simulator. Built on NVIDIA Isaac Sim. The GenManip framework provides the simulation server, scenes, and asset packaging.
  • Architecture. Client–server: the server runs the simulation as a black box; your model talks to it through a lightweight client package. See Environment Setup.
  • Robot. All tasks use the lift2 embodiment — dual-arm with a mobile base and four 480×640 cameras. Per-frame state/action keys are listed in Asset & Dataset → Modalities per frame.
  • Tasks. 26 evaluation tasks covering long-horizon, dexterous, and mobile manipulation. Browse them in Task Showcase.

EBench organizes tasks into three submission tracks:

TrackFocusBacked by training subset(s)
mobile_manipPick-and-place with a mobile baselong_horizon, simple_pnp
table_top_manipDexterous tabletop tasksteleop_tasks
generalistMixed across categories (union of the two)all of the above

Each track is evaluated on three splits: val_train, val_unseen, test.

Split semantics — WIP. A precise breakdown of which tasks/seeds land in each split will be documented here.

For how to submit each track, see Run Evaluation and the Challenge guide.

  • Per-episode task score — a value in [0.0, 1.0]. An episode receives the full score when the task’s goal condition is met within the episode, otherwise 0.0. Per-task success semantics live in Task Showcase under each task’s Score description.
  • Track score — the average of per-episode scores across all evaluated episodes in the submitted track/split.
  • Leaderboard — track scores are aggregated on the Challenge leaderboard.

Episode counts and time budgets — WIP. Number of episodes per track/split and per-episode step limits will be documented here.