unified_video_action
unified_video_action copied to clipboard
Official PyTorch Implementation of Unified Video Action Model (RSS 2025)
Unified Video Action Model
[Project page] [Paper] [Colab (PushT)]
Shuang Li, Yihuai Gao, Dorsa Sadigh, Shuran Song
Stanford University
🛝 Try UVA on Colab
We provide a colab notebook for UVA on PushT using the pretrained checkpoint.
🛠️ Installation
Install the conda environment:
$ mamba env create -f conda_environment.yaml
Simulation Experiments
Testing
Download the pretrained checkpoints from the following links and put them in the checkpoints/ folder.
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/pusht.ckpt --output_dir checkpoints/pusht
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/pusht_multitask.ckpt --output_dir checkpoints/pusht_multitask
CUDA_VISIBLE_DEVICES=0 python eval_sim.py --checkpoint checkpoints/libero10.ckpt --output_dir checkpoints/libero10
Training
Download Pretrained Models
We start from a pretrained VAE model and a pretrained image generation model MAR. Run the following command to download the pretrained models.
python unified_video_action/utils/download.py
Train Video Generation Model
We found that two-stage training works better than training on both video and action tasks directly. In the first stage, the model is trained on video generation task, and in the second stage, it is fine-tuned on both video and action tasks.
To train the UVA model for the video generation task, we set predict_action=False and selected_training_mode=video_model. We did not incorporate additional video data during training. We believe that pretraining the model on large-scale web video datasets could substantially improve its generalization capabilities, and we plan to explore this approach in future work.
UVA's performance may currently be constrained by the model size. To evaluate it on larger or more complex real-world tasks, please consider using a larger UVA model.
Training video and action model takes longer time than training policy model only. We recommend using at least 4 GPUs for training. To train the UVA model on the PushT dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_pusht.yaml \
model.policy.action_model_params.predict_action=False \
model.policy.selected_training_mode=video_model \
model.policy.optimizer.learning_rate=1e-4 \
logging.project=uva \
hydra.run.dir="checkpoints/uva_pusht_video_model"
Train Joint Video and Action Model
To train the UVA model on the joint video and action tasks, we set predict_action=True and remove selected_training_mode=video_model.
To train the UVA model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_pusht.yaml \
model.policy.autoregressive_model_params.pretrained_model_path=checkpoints/uva_pusht_video_model/checkpoints/latest.ckpt \
model.policy.action_model_params.predict_action=True
model.policy.optimizer.learning_rate=1e-4 \
logging.project=uva \
hydra.run.dir="uva_pusht_video_act_model"
Real Robot Experiments
Be careful when conducting real robot experiments. The robot moves quickly and can be dangerous.
Testing
Download the pretrained checkpoints from the following links and put them in the checkpoints/ folder.
This checkpoint is trained on 500 samples from each of the three datasets: Cup, Towel, and Mouse.
- Checkpoint trained on UMI Multitask
ARX X5 Robot Setup
Please follow the instructions in arx5-sdk to setup the ARX X5 robot controller. Other models of robot arms could be used by modifying the arguments when running the controller.
To setup the UMI-related hardware (Camera, Gripper, etc.), please refer to the codebase of UMI-on-Legs and check out the 3d printing and assembly instructions.
UVA Deployment
We recommend first deploying the umi-arx codebase to test the hardware setup. For UVA deployment, please checkout the uva branch for some updates with more safety checks.
Instead of running the detached_policy_inference.py in the UMI codebase, please run sh scripts/eval/eval_real.sh to serve the UVA model. You can modify the parameters in the eval_real.sh for different checkpoints and tcp ports. The rest of the deployment process is the same as the original UMI codebase.
Training
Train Video Generation Model
To train the video generation model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_umi_multi.yaml \
model.policy.action_model_params.predict_action=False \
model.policy.selected_training_mode=video_model \
model.policy.different_history_freq=True \
model.policy.optimizer.learning_rate=1e-4 \
task.dataset.dataset_root_dir=${dataset_path} \
logging.project=uva \
hydra.run.dir="checkpoints/uva_umi_multitask_video"
For all real-world experiments, we set different_history_freq=True to use distinct history frequencies during training. Since the control frequency on real robot may differ from the data frequency in the collected dataset, using different history frequencies helps the model perform better during testing.
Train Joint Video and Action Model
To train the UVA model on the UMI multi-task dataset, run the following command:
accelerate launch --num_processes=8 train.py \
--config-dir=. \
--config-name=uva_umi_multi.yaml \
model.policy.autoregressive_model_params.pretrained_model_path=checkpoints/uva_umi_multitask_video/checkpoints/latest.ckpt \
model.policy.action_model_params.predict_action=True \
model.policy.use_proprioception=True \
model.policy.predict_proprioception=True \
model.policy.shift_action=False \
model.policy.different_history_freq=True \
model.policy.optimizer.learning_rate=1e-4 \
task.dataset.dataset_root_dir=${dataset_path} \
task.dataset.used_episode_indices_file=${indices_file} \
logging.project=uva \
hydra.run.dir="uva_umi_multitask_video_action"
Dataset
All datasets are publicly available except for PushT-M. We extend the PushT task by incorporating various target “T” positions and have collected a new dataset containing 247 demonstrations. Download the datasets and put them in the data folder.
Simulation Datasets
- PushT from Diffusion Policy.
- PushT-M from us. Download the file, extract its contents, and place them in the
datafolder. - Libero10 from LIBERO. We replayed the data to extract the absolute actions and appended language tokens from CLIP using
AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"). Download both the original hdf5 file and the converted dataset. Then, extract their contents and place them in thedatafolder. - Toolhang from Diffusion Policy. We use the file
ph/image_abs.hdf5. Place it in thedata/tool_hang/ph/image_abs.hdf5folder.
Real-World Datasets
- UMI CUP Arrangement from UMI.
- UMI Towel Folding from Data Scaling Laws in Imitation Learning for Robotic Manipulation.
- UMI Mouse Arrangement from Data Scaling Laws in Imitation Learning for Robotic Manipulation.
- More UMI Datasets for large-scale training. Please run
process_dataset/download_dataset.pyto download and process the datasets.
UMI Multi-Task Dataset Processing
We modified the UMI dataloader to support multiple UMI datasets. We also optimized the memory usage and data loading speed especially when running on a SLURM system for large scale training.
The pipeline of processing the dataset is as follows, see process_dataset/download_dataset.py for more details:
- Download the dataset (
.zarr.zipformat) from the corresponding urls. You can comment out the lines you don't need. - Copy the dataset into shared memory (
/dev/shm) and decompress it to a.zarrfolder. The script is processing all the selected datasets in parallel, thus please make sure the server has enough available memory (at least 500GB). If not, you can run theprocess_datasetfunction (indownload_dataset.py) inside aforloop. - Compress the dataset using
lz4for faster compression and decompression speed. Then copy the.zarr.tar.lz4files back to yourdata_dir.
During training, you can run process_dataset/extract_umi_data.py to extract multiple datasets into your shared memory /dev/shm or a local disk in a SLURM system. When loading data batches, the dataloader unified_video_action/dataset/umi_multi_dataset.py will randomly choose a UMI dataset and fetch the data from the shared memory in a "lazy" manner, i.e. only copy the data into program memory when needed and release it afterwards. Therefore during training, there will not be duplicated data in memory even if you are training on multiple GPUs.
Note that we do not use mirrors in the deployment setup. Therefore, we mask out all the mirror regions in the dataset whose gripper has mirror. You can modify the mask_mirror option in umi_multi.yaml to specify individually for each dataset.
For multi-node training, please refer to scripts/training/train_uva_umi_multi_node.sh if you are using SLURM.
🩹 Add Your Own Task
To add your own task, you need to implement a dataset, an environment runner, and a task configuration file. For guidance, please refer to the following examples from existing tasks:
unified_video_action/config/task/umi_multi.yamlunified_video_action/dataset/umi_multi_dataset.py
Make sure that shape_meta correspond to input and output shapes for your task. Make sure env_runner._target_ and dataset._target_ point to the new classes you have added. When training, add task=<your_task_name> to train.py's arguments.
🩹 Add Your Own Model
To add your own model, you need to implement a configuration file, a workspace, and a policy file. For guidance, please refer to the following examples from existing models:
unified_video_action/config/model/uva.yamlunified_video_action/workspace/train_unified_video_action_workspace.pyunified_video_action/policy/unified_video_action_policy.py
🙋 Questions & Answers
Are there any tips for training UVA?
We found that two-stage training works better than training on both video and action tasks simultaneously. In the first stage, the model is trained on video generation, and in the second stage, it is fine-tuned on both video and action tasks.
How long does it take to train UVA?
Training time depends on both the size of the dataset and the complexity of the task. For the UMI task, we sampled 500 trajectories from each of the three datasets and trained the model using 8 H100 GPUs. The video generation task was trained for 2 days, while the joint video and action generation requires an additional 2 days.
What's the next step for UVA?
We believe there is still significant potential in UVA that remains unexplored, and we leave this for future work.
Additional video data: UVA can leverage large amounts of actionless video data, which could provide valuable additional supervision. We plan to pretrain UVA on additional video data in the future.
Multi-modality: UVA can be naturally extended to predict modalities beyond video and action, such as sound and force, by incorporating additional diffusion heads, offering a more comprehensive and versatile framework.
Better architecture: The model architecture can be futuer improved by replacing the diffusion heads with flow matching.
Larger model size: UVA's performance may currently be limited by the model size. We plan to explore larger models in the future.
🏷️ License
This repository is provided under the MIT license. For more details, please refer to LICENSE.
🙏 Acknowledgement
- Lots of code are inherited from Diffusion Policy and MAR.
- For real-world UMI experiments, we use the public datasets collected by UMI and Data Scaling Laws in Imitation Learning for Robotic Manipulation.