Model#

This tutorial introduces the structure and implementation of both System 1 (navdp) and System 2 (rdp) policy models in the InterNav framework.


System 1: Navdp#

This tutorial introduces the structure and implementation of the navdp policy model in the InterNav framework, helping you understand and customize each module.


Model Structure Overview#

The navdp policy model in InterNav mainly consists of the following parts:

  • RGBD Encoder (NavDP_RGBD_Backbone): Extracts multi-frame RGB+Depth features.

  • Goal Point/Image Encoder: Encodes goal point or goal image information.

  • Transformer Decoder: Temporal modeling and action generation.

  • Action Head / Value Head: Outputs action sequences and value estimation.

  • Diffusion Scheduler: For action generation via diffusion process.

The model entry is NavDPNet (in internnav/model/basemodel/navdp/navdp_policy.py), which inherits from transformers.PreTrainedModel and supports HuggingFace-style loading and fine-tuning.


Main Module Explanation#

1. RGBD Encoder#

Located in internnav/model/encoder/navdp_backbone.py:

class NavDP_RGBD_Backbone(nn.Module):
    def __init__(self, image_size=224, embed_size=512, ...):
        ...
    def forward(self, images, depths):
        # Input: [B, T, H, W, 3], [B, T, H, W, 1]
        # Output: [B, memory_size*16, token_dim]
        ...
  • Supports multi-frame historical image and depth input, outputs temporal features.

  • Optional finetune.

2. Goal Point/Image Encoder#

  • Goal point encoding: nn.Linear(3, token_dim)

  • Goal image encoding: NavDP_ImageGoal_Backbone / NavDP_PixelGoal_Backbone

3. Transformer Decoder#

self.decoder_layer = nn.TransformerDecoderLayer(d_model=token_dim, nhead=heads, ...)
self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=temporal_depth)
  • Responsible for temporal modeling and conditional action generation.

4. Action Head and Value Head#

self.action_head = nn.Linear(token_dim, 3)   # Output action
self.critic_head = nn.Linear(token_dim, 1)   # Output value

5. Diffusion Scheduler#

self.noise_scheduler = DDPMScheduler(num_train_timesteps=10, ...)
  • Used for diffusion-based action generation and denoising.


Example Forward Process#

The core forward process is as follows:

def forward(self, goal_point, goal_image, input_images, input_depths, output_actions, augment_actions):
    # 1. Encode historical RGBD
    rgbd_embed = self.rgbd_encoder(input_images, input_depths)
    # 2. Encode goal point/image
    pointgoal_embed = self.point_encoder(goal_point).unsqueeze(1)
    # 3. Noise sampling and diffusion
    noise, time_embeds, noisy_action_embed = self.sample_noise(output_actions)
    # 4. Conditional decoding to generate actions
    cond_embedding = ...
    action_embeddings = ...
    output = self.decoder(tgt=action_embeddings, memory=cond_embedding, tgt_mask=self.tgt_mask)
    # 5. Output action and value
    action_pred = self.action_head(output)
    value_pred = self.critic_head(output.mean(dim=1))
    return action_pred, value_pred

Key Code Snippets#

Load Model#

from internnav.model.basemodel.navdp.navdp_policy import NavDPNet, NavDPModelConfig
model = NavDPNet(NavDPModelConfig(model_cfg=...))

Customization and Extension#

To customize the backbone, decoder, or heads, refer to navdp_policy.py and navdp_backbone.py, implement your own modules, and replace them in the configuration.


Reference#


System 2: InternVLA-N1-S2#

*TODO