Training#
This tutorial provides a detailed guide for training both Dual System (InternVLA-N1) and System 1 (NavDP) policy models within the InterNav framework.
Dual-System: InternVLA-N1#
1. Environment Preparation#
Ensure you have installed InterNav and its dependencies, and have access to a multi-GPU environment.
2. Start Training System 2#
# training system2 separately
sbatch ./scripts/train/base_train/qwenvl_train/train_system2.sh
Pretrained Model Configuration#
# Model configuration
llm=Qwen/Qwen2.5-VL-7B-Instruct
Dataset Configuration#
# Dataset configuration
vln_datasets=r2r_125cm_0_30,r2r_125cm_0_45,r2r_60cm_15_15,r2r_60cm_30_30,rxr_125cm_0_30,rxr_125cm_0_45,rxr_60cm_15_15,rxr_60cm_30_30
#Naming Convention: dataset_height_pitch1_pitch2
#- **125cm / 60cm**: Agent height
#- **0_30**: Agent pitch starts at 0° elevation shift and shifts to 30° when output ⬇️
#- **15_15**: Agent pitch starts at 15° elevation shift and keeps 15° when output ⬇️
Training Hyperparameters#
# Training hyperparameters
lr=2e-5 # Global learning rate
vision_tower_lr=5e-6 # Vision encoder learning rate (slower than LLM)
batch_size=2 # Per-GPU batch size
grad_accum_steps=1 # Gradient accumulation steps
# Virtual batch size = batch_size × grad_accum_steps × num_gpus
max_pixels=313600 # Maximum image pixels for processing
min_pixels=3136 # Minimum image pixels
Training Architecture Parameters#
# Model architecture tuning flags
tune_mm_vision=True # Fine-tune multimodal vision encoder
tune_mm_mlp=True # Fine-tune multimodal MLP adapter
tune_mm_llm=True # Fine-tune language model components
# Data augmentation and temporal processing
data_augmentation=True # Apply data augmentation
num_history=8 # Number of historical observations (frames)
sample_step=4 # Frame sampling rate (every 4th frame)
num_future_steps=4 # Number of future steps to predict
2. Start Joint Training of System 2 and System 1#
# training system1 based on system2
sbatch ./scripts/train/base_train/qwenvl_train/train_dual_system.sh
Pretrained Model Configuration#
# Model configuration
system2_ckpt=checkpoints/InternVLA-N1-System2
Dataset Configuration#
# Dataset configuration
vln_datasets=r2r_125cm_0_30%30,r2r_60cm_15_15%30,rxr_125cm_0_30%30,rxr_60cm_15_15%30,scalevln_125cm_0_30%30,scalevln_60cm_30_30%30
# %30 means using 30% of the data from each dataset
Training Architecture Parameters#
# Freeze System 2 weights during joint training
tune_mm_vision=False
tune_mm_mlp=False
tune_mm_llm=False
# Planning and action configuration
predict_step_num=32 # Number of predicted waypoints
pixel_goal_only=True # Turn and stop actions are not required at this stage
# System 1 backend selection
system1=${system1} # Supported options: nextdit_async, nextdit, navdp_async
Baselines#
Create a Trainer#
The Trainer manages the training loop, including data loading, forward pass, loss calculation, and backpropagation.
A custom trainer usually inherits from the Base Trainer and implements:
train_epoch(): Runs one training epoch (batch iteration, forward pass, loss calculation, parameter update).eval_epoch(): Evaluates the model on the validation set and records metrics.save_checkpoint(): Saves model weights, optimizer state, and training progress.load_checkpoint(): Loads pretrained models or resumes training.
Example: CMATrainer shows how to handle sequence data, compute action loss, and implement imitation learning.
Training Data#
The training data is under data/vln_pe/traj_data. Our dataset provides trajectory data collected from the H1 robot as it navigates through the task environment.
Each observation in the trajectory is paired with its corresponding action.
You may also incorporate external datasets to improve model generalization.
Set the Corresponding Configuration#
Refer to existing training configuration files for customization:
CMA Model Config:
cma_exp_cfg
Configuration files should define:
ExpCfg(experiment config)EvalCfg(evaluation config)IlCfg(imitation learning config)
Ensure your configuration is imported and registered in __init__.py.
Key parameters include:
name: Experiment namemodel_name: Must match the name used during model registrationbatch_size: Batch sizelr: Learning rateepochs: Number of training epochsdataset_*_root_dir: Dataset pathslmdb_features_dir: Feature storage path