Training#
This tutorial provides a detailed guide for training both System 1 (NavDP) and whole system (InternVLA-N1-S2) policy models within the InterNav framework.
System 1: NavDP#
This tutorial provides a detailed guide for training the NavDP policy model within the InterNav framework. It covers the training workflow, configuration and parameters, command-line usage, and troubleshooting.
Overview of the Training Process#
The NavDP training process in InterNav includes the following steps:
- Model Initialization: Load NavDP configuration and initialize model structure and parameters. 
- Dataset Loading: Configure dataset paths and preprocessing, build the DataLoader. 
- Training Parameter Setup: Set batch size, learning rate, optimizer, and other hyperparameters. 
- Distributed Training Environment Initialization: Multi-GPU training is supported out of the box. 
- Training Execution: Start the main training loop, with automatic checkpointing and logging. 
Quick Start#
1. Environment Preparation#
Ensure you have installed InterNav and its dependencies, and have access to a multi-GPU environment.
2. Configuration Check#
The NavDP training configuration file is located at:
InternNav/scripts/train/configs/navdp.py
You can modify parameters such as batch_size, epochs, and dataset path as needed.
3. Start Training#
Use the provided shell script for one-click startup:
cd InternNav/scripts/train
bash start_train.sh --name <experiment_name> --model navdp
- <experiment_name>: Custom name for this experiment (e.g., 20250723_navdp_train_debug).
This script will automatically allocate 8 GPUs and use torchrun to launch distributed training.
Core Command in the Script#
torchrun \
    --nproc_per_node=8 \
    --master_port=29500 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=12345 \
    scripts/train/train.py \
    --name "$NAME" \
    --model-name "$MODEL"
Training Parameters and Configuration#
The main training parameters for NavDP are set in scripts/train/configs/navdp.py. Common parameters include:
| Parameter | Description | Example | 
|---|---|---|
| epochs | Number of training epochs | 1000 | 
| batch_size | Batch size per GPU | 16 | 
| lr | Learning rate | 1e-4 | 
| num_workers | DataLoader workers | 8 | 
| dataset_navdp | Dataset json path | data/datasets/navdp_dataset.json | 
| image_size | Input image size | 224 | 
| memory_size | Number of history frames | 8 | 
| predict_size | Prediction steps | 24 | 
| temporal_depth | Transformer layers | 16 | 
| token_dim | Feature dimension | 384 | 
| dropout | Dropout probability | 0.1 | 
| finetune | Whether to finetune backbone | False | 
For more parameters, see the comments in the configuration file.
Logging and Model Saving#
- Logs, tensorboard files, and checkpoints are saved by default under - data/checkpoints/<experiment_name>/.
- Tensorboard is supported for visualizing the training process. 
Troubleshooting#
- Multi-GPU training error: Check that - CUDA_VISIBLE_DEVICESmatches the actual number of GPUs.
- Dataset path error: Ensure the json file at - dataset_navdpexists and is correctly formatted.
- Out of memory: Try reducing - batch_sizeor- image_size.
For customizing the model structure or dataset format, see model.md and dataset.md.
System 2: InternVLA-N1-S2#
Currently we don’t support the training of InternVLA-N1-S2 in this repository.
