Customizing Models and Agents in InternNav#

This tutorial provides a detailed guide for registering new agent and model within the InternNav framework

Development Overview#

The main architecture of the evaluation code adopts a client-server model. In the client, we specify the corresponding configuration (*.cfg), which includes settings such as the scenarios to be evaluated, robots, models, and parallelization parameters. The client sends requests to the server, which then make model to predict and response to the client.

The InternNav project adopts a modular design, allowing developers to easily add new navigation algorithms. The main components include:

Model: Implements the specific neural network architecture and inference logic
Agent: Serves as a wrapper for the Model, handling environment interaction and data preprocessing
Config: Defines configuration parameters for the model and training

Supported Models#

InternVLA-N1
CMA (Cross-Modal Attention)
RDP (Recurrent Diffusion Policy)
Navid (RSS2023)
Seq2Seq Policy

Custom Model#

A Model is the concrete implementation of your algorithm. Implement model under baselines/models. A model ideally would inherit from the base model and implement the following key methods:

forward(train_batch) -> dict(output, loss)
inference(obs_batch, state) -> output_for_agent

Create a Custom Config Class#

In the model file, define a Config class that inherits from PretrainedConfig. A reference implementation is CMAModelConfig in cma_model.py.

Registration and Integration#

In internnav/model/__init__.py:

Add the new model to get_policy.
Add the new model’s configuration to get_config.

Create a Custom Agent#

The Agent handles interaction with the environment, data preprocessing/postprocessing, and calls the Model for inference. A custom Agent usually inherits from Agent and implements the following key methods:

reset(): Resets the Agent’s internal state (e.g., RNN states, action history). Called at the start of each episode.
inference(obs): Receives environment observations obs, performs preprocessing (e.g., tokenizing instructions, padding), calls the model for inference, and returns an action.
step(obs): The external interface, usually calls inference, and can include logging or timing.

Example: CMAAgent

For each step, the agent should expect an observation from environment.

For the vln benchmark under internutopia:

action = self.agent.step(obs)

obs has format:

obs = [{
    'globalgps': [X, Y, Z]              # robot location
    'globalrotation': [X, Y, Z, W]      # robot orientation in quaternion
    'rgb': np.array(256, 256, 3)        # rgb camera image
    'depth': np.array(256, 256, 1)      # depth image
    'instruction': str                  # language instruction for the navigation task
}]

action has format:

action = List[int]                      # action for each environments
# 0: stop
# 1: move forward
# 2: turn left
# 3: turn right

Registration#

The agent should be registered to internnav.agent, so it can be used by the name through configs.

from internnav.agent.base import Agent
from internnav.configs.agent import AgentCfg

@Agent.register('cma')
class NewAgent(Agent):
    def __init__(self, agent_config: AgentCfg):
        ...

Make sure you also import it inside internnav/agent/__init__.py

# make the register decorator taking effect
from internnav.agent.internvla_n1_agent import InternVLAN1Agent

Agent and Model Initialization#

Refer to existing evaluation config files for customization:

agent_cfg=AgentCfg(
    server_host='localhost',
    server_port=8023,
    model_name='internvla_n1',
    ckpt_path='',
    model_settings={
        policy_name='InternVLAN1_Policy',
        state_encoder=None,
    },
)

Typical Usage Example#

from internnav.configs.agent import AgentCfg

cfg = AgentCfg(server_host="127.0.0.1", server_port=8087)
client = AgentClient(cfg)

# step once
obs = [{"rgb": ..., "depth": ..., "instruction": "go to kitchen"}]
action = client.step(obs)
print("Predicted action:", action)

# reset agent
client.reset()