Training#
This document provides instructions for training models in InternNav.
Overview#
InternNav supports training models under three system paradigms:
Dual-System VLN Models: integrated System2 + System1 architectures
Single-System VLN Models: end-to-end vision-and-language navigation models
VN System (System1) Models: low-level visual navigation and control models
Each paradigm follows a different training protocol, which is detailed below.
Dual-System VLN Models#
Dual-System VLN Models integrates System2 (high-level reasoning and planning) with
System1 (low-level action control), supporting both modular integration and joint training.
Supported Systems#
InternVLA-N1 (System2)
InternVLA-N1 (Dual System) w/ NavDP* (NavDP* indicates joint tuning with System2)
InternVLA-N1 (Dual System) DualVLN
1. Training for InternVLA-N1 (System2)#
InternVLA-N1 (System2) is trained independently to predict 2D pixel goals for navigation.
It can be used with any compatible System1 model capable of executing 2D pixel goals or point goals (given depth and pose).
Alternatively, it can be jointly trained together with a System1 model for end-to-end multi-system optimization.
Training Command#
# training system2 separately
sbatch ./scripts/train/qwenvl_train/train_system2.sh
2. Joint Training for InternVLA-N1 (Dual System)#
After completing training of InternVLA-N1 (System2), joint training is supported with a pixel-goal navigation System1, using either the NavDP or NextDiT architecture.
InternVLA-N1 (Dual System) w/ NavDP: preserves NavDP’s model design and uses RGB-D input.
InternVLA-N1 (Dual System) DualVLN: uses only RGB input, resulting in a smaller model footprint.
Training Command#
# training system1 based on system2
sbatch ./scripts/train/qwenvl_train/train_dual_system.sh
For w/ NavDP model variant, set
system1=navdp_async. Optimal performance is typically observed after 30,000 iterations.For DualVLN model variant, set
system1=nextdit_async. Optimal performance is typically observed after 15,000 iterations.
Single-System VLN Models#
Single-System VLN Models directly map visual observations and language instructions to navigation actions in an end-to-end manner.
Supported Models#
The following Single-System VLN Models are currently supported:
Seq2Seq
CMA
RDP
For our VLM-based VLN model StreamVLN, please refer to the following repository for training details:
https://github.com/InternRobotics/StreamVLN
Support for StreamVLN within InternNav is planned for future releases.
Training Command#
Training is performed through a unified training entry script.
Below are example commands for each supported model.
Seq2Seq
./scripts/train/base_train/start_train.sh --name seq2seq_train --model seq2seq
CMA
./scripts/train/base_train/start_train.sh --name cma_train --model cma
RDP
./scripts/train/base_train/start_train.sh --name rdp_train --model rdp
VN System (System1) Models#
VN System (System1) focuses on low-level visual navigation and motion control.
Supported Methods#
The following visual navigation methods are included in the System1 benchmark:
DD-PPO
iPlanner
ViPlanner
GNM
ViNT
NoMaD
NavDP (InternVLA-N1 System1)
Among them, only NavDP is currently supported for training in InternNav.
All other methods are provided for evaluation and comparison purposes only.
Training Command#
NavDP
./scripts/train/base_train/start_train.sh --name navdp_train --model-name navdp