openpi
openpi copied to clipboard
cpu memory of Commitment Ratio increasing which causes fine-tuning crash
Hi authors,
I am following official fine-tuning script, command CUDA_VISIBLE_DEVICES=1 uv run python scripts/train.py pi05_robomimic_lift --exp-name=exp_pi05_robomimic_lift --overwrite with my own config:
TrainConfig(
name= "pi05_robomimic_lift",
# name = "pi05_genesis",
# name = "pi05_libero",
model=pi0_fast.Pi0FASTConfig(
action_dim=7, action_horizon=10, max_token_len=180, paligemma_variant="gemma_2b_lora"
),
data=LeRobotLiberoDataConfig(
repo_id="yananchen/robomimic_lift",
# repo_id= 'kaveh-kamali/genesis_absolute_EE_multi_start',
# repo_id="physical-intelligence/libero",
base_config=DataConfig(prompt_from_task=True),
extra_delta_transform=True,
),
weight_loader=weight_loaders.CheckpointWeightLoader("gs://openpi-assets/checkpoints/pi05_base/params"),
num_train_steps=500_000,
# Again, make sure to match the model config above when extracting the freeze filter
# that specifies which parameters should be frozen during LoRA finetuning.
freeze_filter=pi0_fast.Pi0FASTConfig(
action_dim=7, action_horizon=10, max_token_len=180, paligemma_variant="gemma_2b_lora"
).get_freeze_filter(),
# Turn off EMA for LoRA finetuning.
ema_decay=0.999,
wandb_enabled=False,
batch_size=8,
optimizer=_optimizer.AdamW(clip_gradient_norm=1.0),
lr_schedule=_optimizer.CosineDecaySchedule(
warmup_steps=10_000,
peak_lr=5e-5,
decay_steps=1_000_000,
decay_lr=5e-5,
),
),
but according to the log of sysstat, the %commit cpu memory is increasing and finally it causes the fine-tuning process shut down.
as you can see in the screenshot below (time: 10:30:00) :
checkpoint saving interrupted at nov 2, 10:33:
related issue: https://github.com/Physical-Intelligence/openpi/issues/721
any hints ?
thanks.