# Model This tutorial introduces the structure and implementation of both System 1 (navdp) and System 2 (rdp) policy models in the InterNav framework. --- ## System 1: Navdp This tutorial introduces the structure and implementation of the navdp policy model in the InterNav framework, helping you understand and customize each module. --- ### Model Structure Overview The navdp policy model in InterNav mainly consists of the following parts: - **RGBD Encoder (NavDP_RGBD_Backbone)**: Extracts multi-frame RGB+Depth features. - **Goal Point/Image Encoder**: Encodes goal point or goal image information. - **Transformer Decoder**: Temporal modeling and action generation. - **Action Head / Value Head**: Outputs action sequences and value estimation. - **Diffusion Scheduler**: For action generation via diffusion process. The model entry is `NavDPNet` (in `internnav/model/basemodel/navdp/navdp_policy.py`), which inherits from `transformers.PreTrainedModel` and supports HuggingFace-style loading and fine-tuning. --- ### Main Module Explanation #### 1. RGBD Encoder Located in `internnav/model/encoder/navdp_backbone.py`: ```python class NavDP_RGBD_Backbone(nn.Module): def __init__(self, image_size=224, embed_size=512, ...): ... def forward(self, images, depths): # Input: [B, T, H, W, 3], [B, T, H, W, 1] # Output: [B, memory_size*16, token_dim] ... ``` - Supports multi-frame historical image and depth input, outputs temporal features. - Optional finetune. #### 2. Goal Point/Image Encoder - Goal point encoding: `nn.Linear(3, token_dim)` - Goal image encoding: `NavDP_ImageGoal_Backbone` / `NavDP_PixelGoal_Backbone` #### 3. Transformer Decoder ```python self.decoder_layer = nn.TransformerDecoderLayer(d_model=token_dim, nhead=heads, ...) self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=temporal_depth) ``` - Responsible for temporal modeling and conditional action generation. #### 4. Action Head and Value Head ```python self.action_head = nn.Linear(token_dim, 3) # Output action self.critic_head = nn.Linear(token_dim, 1) # Output value ``` #### 5. Diffusion Scheduler ```python self.noise_scheduler = DDPMScheduler(num_train_timesteps=10, ...) ``` - Used for diffusion-based action generation and denoising. --- ### Example Forward Process The core forward process is as follows: ```python def forward(self, goal_point, goal_image, input_images, input_depths, output_actions, augment_actions): # 1. Encode historical RGBD rgbd_embed = self.rgbd_encoder(input_images, input_depths) # 2. Encode goal point/image pointgoal_embed = self.point_encoder(goal_point).unsqueeze(1) # 3. Noise sampling and diffusion noise, time_embeds, noisy_action_embed = self.sample_noise(output_actions) # 4. Conditional decoding to generate actions cond_embedding = ... action_embeddings = ... output = self.decoder(tgt=action_embeddings, memory=cond_embedding, tgt_mask=self.tgt_mask) # 5. Output action and value action_pred = self.action_head(output) value_pred = self.critic_head(output.mean(dim=1)) return action_pred, value_pred ``` --- ### Key Code Snippets #### Load Model ```python from internnav.model.basemodel.navdp.navdp_policy import NavDPNet, NavDPModelConfig model = NavDPNet(NavDPModelConfig(model_cfg=...)) ``` #### Customization and Extension To customize the backbone, decoder, or heads, refer to `navdp_policy.py` and `navdp_backbone.py`, implement your own modules, and replace them in the configuration. --- ### Reference - [navdp_policy.py](../../internnav/model/basemodel/navdp/navdp_policy.py) - [navdp_backbone.py](../../internnav/model/encoder/navdp_backbone.py) - [navdp.py config](../../scripts/train/configs/navdp.py) --- ## System 2: InternVLA-N1-S2 *TODO