Model#
This tutorial introduces the structure and implementation of both System 1 (navdp) and System 2 (rdp) policy models in the InterNav framework.
System 1: Navdp#
This tutorial introduces the structure and implementation of the navdp policy model in the InterNav framework, helping you understand and customize each module.
Model Structure Overview#
The navdp policy model in InterNav mainly consists of the following parts:
RGBD Encoder (NavDP_RGBD_Backbone): Extracts multi-frame RGB+Depth features.
Goal Point/Image Encoder: Encodes goal point or goal image information.
Transformer Decoder: Temporal modeling and action generation.
Action Head / Value Head: Outputs action sequences and value estimation.
Diffusion Scheduler: For action generation via diffusion process.
The model entry is NavDPNet
(in internnav/model/basemodel/navdp/navdp_policy.py
), which inherits from transformers.PreTrainedModel
and supports HuggingFace-style loading and fine-tuning.
Main Module Explanation#
1. RGBD Encoder#
Located in internnav/model/encoder/navdp_backbone.py
:
class NavDP_RGBD_Backbone(nn.Module):
def __init__(self, image_size=224, embed_size=512, ...):
...
def forward(self, images, depths):
# Input: [B, T, H, W, 3], [B, T, H, W, 1]
# Output: [B, memory_size*16, token_dim]
...
Supports multi-frame historical image and depth input, outputs temporal features.
Optional finetune.
2. Goal Point/Image Encoder#
Goal point encoding:
nn.Linear(3, token_dim)
Goal image encoding:
NavDP_ImageGoal_Backbone
/NavDP_PixelGoal_Backbone
3. Transformer Decoder#
self.decoder_layer = nn.TransformerDecoderLayer(d_model=token_dim, nhead=heads, ...)
self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=temporal_depth)
Responsible for temporal modeling and conditional action generation.
4. Action Head and Value Head#
self.action_head = nn.Linear(token_dim, 3) # Output action
self.critic_head = nn.Linear(token_dim, 1) # Output value
5. Diffusion Scheduler#
self.noise_scheduler = DDPMScheduler(num_train_timesteps=10, ...)
Used for diffusion-based action generation and denoising.
Example Forward Process#
The core forward process is as follows:
def forward(self, goal_point, goal_image, input_images, input_depths, output_actions, augment_actions):
# 1. Encode historical RGBD
rgbd_embed = self.rgbd_encoder(input_images, input_depths)
# 2. Encode goal point/image
pointgoal_embed = self.point_encoder(goal_point).unsqueeze(1)
# 3. Noise sampling and diffusion
noise, time_embeds, noisy_action_embed = self.sample_noise(output_actions)
# 4. Conditional decoding to generate actions
cond_embedding = ...
action_embeddings = ...
output = self.decoder(tgt=action_embeddings, memory=cond_embedding, tgt_mask=self.tgt_mask)
# 5. Output action and value
action_pred = self.action_head(output)
value_pred = self.critic_head(output.mean(dim=1))
return action_pred, value_pred
Key Code Snippets#
Load Model#
from internnav.model.basemodel.navdp.navdp_policy import NavDPNet, NavDPModelConfig
model = NavDPNet(NavDPModelConfig(model_cfg=...))
Customization and Extension#
To customize the backbone, decoder, or heads, refer to navdp_policy.py
and navdp_backbone.py
, implement your own modules, and replace them in the configuration.
Reference#
System 2: InternVLA-N1-S2#
*TODO