# Training This tutorial provides a detailed guide for training both System 1 (navdp) and System 2 (rdp) policy models within the InterNav framework. --- ## System 1: Navdp This tutorial provides a detailed guide for training the navdp policy model within the InterNav framework. It covers the **training workflow**, **configuration and parameters**, **command-line usage**, and **troubleshooting**. --- ### Overview of the Training Process The navdp training process in InterNav includes the following steps: 1. **Model Initialization**: Load navdp configuration and initialize model structure and parameters. 2. **Dataset Loading**: Configure dataset paths and preprocessing, build the DataLoader. 3. **Training Parameter Setup**: Set batch size, learning rate, optimizer, and other hyperparameters. 4. **Distributed Training Environment Initialization**: Multi-GPU training is supported out of the box. 5. **Training Execution**: Start the main training loop, with automatic checkpointing and logging. --- ### Quick Start #### 1. Environment Preparation Ensure you have installed InterNav and its dependencies, and have access to a multi-GPU environment. #### 2. Configuration Check The navdp training configuration file is located at: ```bash InternNav/scripts/train/configs/navdp.py ``` You can modify parameters such as `batch_size`, `epochs`, and dataset path as needed. #### 3. Start Training Use the provided shell script for one-click startup: ```bash cd InternNav/scripts/train bash start_train.sh --name --model navdp ``` - ``: Custom name for this experiment (e.g., 20250723_navdp_train_debug). This script will automatically allocate 8 GPUs and use torchrun to launch distributed training. ##### Core Command in the Script ```bash torchrun \ --nproc_per_node=8 \ --master_port=29500 \ --nnodes=1 \ --node_rank=0 \ --master_addr=localhost \ --master_port=12345 \ scripts/train/train.py \ --name "$NAME" \ --model-name "$MODEL" ``` --- ### Training Parameters and Configuration The main training parameters for navdp are set in `scripts/train/configs/navdp.py`. Common parameters include: | Parameter | Description | Example | |-------------------|---------------------------|---------| | `epochs` | Number of training epochs | 1000 | | `batch_size` | Batch size per GPU | 16 | | `lr` | Learning rate | 1e-4 | | `num_workers` | DataLoader workers | 8 | | `dataset_navdp` | Dataset json path | /path/to/multiview_dataset.json | | `image_size` | Input image size | 224 | | `memory_size` | Number of history frames | 8 | | `predict_size` | Prediction steps | 24 | | `temporal_depth` | Transformer layers | 16 | | `token_dim` | Feature dimension | 384 | | `dropout` | Dropout probability | 0.1 | | `finetune` | Whether to finetune backbone | True | For more parameters, see the comments in the configuration file. --- ### Logging and Model Saving - Logs, tensorboard files, and checkpoints are saved by default under `data/checkpoints//`. - Tensorboard is supported for visualizing the training process. --- ### Troubleshooting - **Multi-GPU training error**: Check that `CUDA_VISIBLE_DEVICES` matches the actual number of GPUs. - **Dataset path error**: Ensure the json file at `dataset_navdp` exists and is correctly formatted. - **Out of memory**: Try reducing `batch_size` or `image_size`. --- For customizing the model structure or dataset format, see [model.md](./model.md) and [dataset.md](./dataset.md). --- ## System 2: InternVLA-N1-S2 *TODO