openpi icon indicating copy to clipboard operation
openpi copied to clipboard

cpu memory of Commitment Ratio increasing which causes fine-tuning crash

Open yanan1116 opened this issue 1 month ago • 0 comments

Hi authors,

I am following official fine-tuning script, command CUDA_VISIBLE_DEVICES=1 uv run python scripts/train.py pi05_robomimic_lift --exp-name=exp_pi05_robomimic_lift --overwrite with my own config:


    TrainConfig(
        name= "pi05_robomimic_lift", 
        # name = "pi05_genesis",
        # name = "pi05_libero",
        model=pi0_fast.Pi0FASTConfig(
            action_dim=7, action_horizon=10, max_token_len=180, paligemma_variant="gemma_2b_lora"
        ),
        data=LeRobotLiberoDataConfig(     
            repo_id="yananchen/robomimic_lift", 
            # repo_id= 'kaveh-kamali/genesis_absolute_EE_multi_start',
            # repo_id="physical-intelligence/libero", 
            base_config=DataConfig(prompt_from_task=True),
            extra_delta_transform=True,
        ),
        weight_loader=weight_loaders.CheckpointWeightLoader("gs://openpi-assets/checkpoints/pi05_base/params"),
        num_train_steps=500_000,
        # Again, make sure to match the model config above when extracting the freeze filter
        # that specifies which parameters should be frozen during LoRA finetuning.
        freeze_filter=pi0_fast.Pi0FASTConfig(
            action_dim=7, action_horizon=10, max_token_len=180, paligemma_variant="gemma_2b_lora"
        ).get_freeze_filter(),
        # Turn off EMA for LoRA finetuning.
        ema_decay=0.999,
        wandb_enabled=False,
        batch_size=8, 
        optimizer=_optimizer.AdamW(clip_gradient_norm=1.0),
        lr_schedule=_optimizer.CosineDecaySchedule(
            warmup_steps=10_000,
            peak_lr=5e-5,
            decay_steps=1_000_000,
            decay_lr=5e-5,
        ),
    ),

but according to the log of sysstat, the %commit cpu memory is increasing and finally it causes the fine-tuning process shut down. as you can see in the screenshot below (time: 10:30:00) : Image

checkpoint saving interrupted at nov 2, 10:33: Image

related issue: https://github.com/Physical-Intelligence/openpi/issues/721

any hints ?

thanks.

yanan1116 avatar Nov 03 '25 02:11 yanan1116