Training#

This tutorial provides a detailed guide for training both System 1 (navdp) and System 2 (rdp) policy models within the InterNav framework.


System 1: Navdp#

This tutorial provides a detailed guide for training the navdp policy model within the InterNav framework. It covers the training workflow, configuration and parameters, command-line usage, and troubleshooting.


Overview of the Training Process#

The navdp training process in InterNav includes the following steps:

  1. Model Initialization: Load navdp configuration and initialize model structure and parameters.

  2. Dataset Loading: Configure dataset paths and preprocessing, build the DataLoader.

  3. Training Parameter Setup: Set batch size, learning rate, optimizer, and other hyperparameters.

  4. Distributed Training Environment Initialization: Multi-GPU training is supported out of the box.

  5. Training Execution: Start the main training loop, with automatic checkpointing and logging.


Quick Start#

1. Environment Preparation#

Ensure you have installed InterNav and its dependencies, and have access to a multi-GPU environment.

2. Configuration Check#

The navdp training configuration file is located at:

InternNav/scripts/train/configs/navdp.py

You can modify parameters such as batch_size, epochs, and dataset path as needed.

3. Start Training#

Use the provided shell script for one-click startup:

cd InternNav/scripts/train
bash start_train.sh --name <experiment_name> --model navdp
  • <experiment_name>: Custom name for this experiment (e.g., 20250723_navdp_train_debug).

This script will automatically allocate 8 GPUs and use torchrun to launch distributed training.

Core Command in the Script#
torchrun \
    --nproc_per_node=8 \
    --master_port=29500 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=12345 \
    scripts/train/train.py \
    --name "$NAME" \
    --model-name "$MODEL"

Training Parameters and Configuration#

The main training parameters for navdp are set in scripts/train/configs/navdp.py. Common parameters include:

Parameter Description Example
epochs Number of training epochs 1000
batch_size Batch size per GPU 16
lr Learning rate 1e-4
num_workers DataLoader workers 8
dataset_navdp Dataset json path /path/to/multiview_dataset.json
image_size Input image size 224
memory_size Number of history frames 8
predict_size Prediction steps 24
temporal_depth Transformer layers 16
token_dim Feature dimension 384
dropout Dropout probability 0.1
finetune Whether to finetune backbone True

For more parameters, see the comments in the configuration file.


Logging and Model Saving#

  • Logs, tensorboard files, and checkpoints are saved by default under data/checkpoints/<experiment_name>/.

  • Tensorboard is supported for visualizing the training process.


Troubleshooting#

  • Multi-GPU training error: Check that CUDA_VISIBLE_DEVICES matches the actual number of GPUs.

  • Dataset path error: Ensure the json file at dataset_navdp exists and is correctly formatted.

  • Out of memory: Try reducing batch_size or image_size.


For customizing the model structure or dataset format, see model.md and dataset.md.


System 2: InternVLA-N1-S2#

*TODO