[Question] SmolVLA LIBERO / MetaWorld evaluation
Hello, thank you for open sourcing this wonderful repository. I have read the SmolVLA paper impressively and tried to run some evaluations.
In Section 4.5 of the paper, under Simulation Evaluation, it seems that you have fine-tuned the SmolVLA baseline to the Franka Emika Panda and the Swayer arm to perform evaluation on the LIBERO and MetaSim benchmark respectively. Could you elaborate on the details of the fine-tuning process? (which parameters were trained/frozen, optimizer, gradient steps, etc..) I am planning to reproduce the results.
Thank you.
Hi, I also had a similar question regarding the fine-tuning process for the LIBERO and Meta-World evaluations.
If you're able to share any pretrained weights or evaluation scripts for these benchmarks, that would be greatly appreciated. Any related visualizations or logs would also be very helpful.
Thanks again for your excellent work and for open-sourcing the project!
@tykim0507 @bigchou
Hi, I also tried to evaluate SmolVLA, but I am not sure how to evaluate it in the LIBERO and Meta-World. It seems that Lerobot does not directly support evaluation in the LIBERO and Meta-World.
If you could share how to evaluate the model trained in the lerobot framework in LIBERO and Meta-World, it would be greatly appreciated.
Thanks again.
Same problem.
Same problem.
Hi team,
I'm currently working on reproducing results and training models on the LIBERO dataset for evaluation within the LIBERO simulation environment. While I've been able to train a model and evaluate it in the simulator, the observed success rate is extremely low.
Initially, I attempted to use the OpenPI LIBERO dataset. However, I've noticed that the LeRobot dataset version seems to be undergoing changes, and the action definitions appear to differ from what I'm expecting. This inconsistency makes it challenging to achieve satisfactory performance.
It would be immensely helpful if the team could provide an open-source example, similar to the OpenPI approach, specifically demonstrating how to effectively train and evaluate models using the LIBERO dataset and simulation environment. Such an example would be invaluable for the community and for users like myself who are trying to achieve higher success rates.
I plan to upload my current script for reference once I've made further progress, in case it can be helpful to others.
Thank you for considering this request!
For training, I'm using the following command:
python lerobot/scripts/train.py \
--policy.type=smolvla \
--dataset.repo_id=~/.cache/huggingface/hub/datasets--aopolin-lv--libero_spatial_no_noops_lerobot_v21/snapshots/bfe55b41cc3103d672ad7204257c9bdc547410cb \
--batch_size=64 \
--steps=200000
I'm specifically using the dataset found at: https://huggingface.co/datasets/aopolin-lv/libero_spatial_no_noops_lerobot_v21
This dataset supports LeRobot dataset v21, as the OpenPI LIBERO dataset (which is v20) is not compatible with the current setup.
For evaluation, I've written a simple script, eval_LIBERO.py, inspired by the OpenPI examples. I run it with the following command:
python lerobot/scripts/eval_LIBERO.py --policy_path=outputs/train/2025-06-30/21-19-58_smolvla/checkpoints/last/pretrained_model
Both training and evaluation are conducted on the libero_spatial task. Despite this, the success rate I'm observing is very low. I suspect there might be a mistake in my setup, but the overall pipeline seems functional. Anyone interested in replicating this can follow these steps. Notice I think that the gripper action seems reverse between dataset and simulator.
And it is maybe due to the dataset is not correctly convert, this link maybe helpful https://github.com/Tavish9/any4lerobot/tree/main/libero2lerobot
- eval_LIBERO.py
"""
This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark.
"""
import collections
import dataclasses
import logging
import math
import pathlib
import cv2
import draccus
import imageio
import numpy as np
import torch
from libero.libero import benchmark, get_libero_path
from libero.libero.envs import OffScreenRenderEnv
from tqdm import tqdm
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0]
LIBERO_ENV_RESOLUTION = 256 # resolution used to render training data
@dataclasses.dataclass
class Args:
"""
Evaluation arguments for smolVLA on LIBERO.
"""
# --- Hugging Face arguments ---
policy_path: str = "lerobot/smolvla_base"
"""Path to the pretrained policy on the Hugging Face Hub or local directory."""
# --- LIBERO environment-specific parameters ---
task_suite_name: str = "libero_spatial"
"""Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
num_steps_wait: int = 10
"""Number of steps to wait for objects to stabilize in sim."""
num_trials_per_task: int = 50
"""Number of rollouts per task."""
# --- Evaluation arguments ---
video_out_path: str = "data/libero/videos"
"""Path to save videos."""
device: str = "cuda"
"""Device to use for evaluation."""
seed: int = 7
"""Random Seed (for reproducibility)"""
@draccus.wrap()
def eval_libero(args: Args) -> None:
# Set random seed
torch.manual_seed(args.seed)
np.random.seed(args.seed)
# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.to(args.device)
policy.eval()
# --- Initialize LIBERO task suite ---
benchmark_dict = benchmark.get_benchmark_dict()
try:
task_suite = benchmark_dict[args.task_suite_name]()
except KeyError:
raise ValueError(
f"Unknown task suite: {args.task_suite_name}. "
f"Available options are: {list(benchmark_dict.keys())}"
)
num_tasks_in_suite = task_suite.n_tasks
logging.info(f"Task suite: {args.task_suite_name}")
pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)
if args.task_suite_name == "libero_spatial":
max_steps = 220 # longest training demo has 193 steps
elif args.task_suite_name == "libero_object":
max_steps = 280 # longest training demo has 254 steps
elif args.task_suite_name == "libero_goal":
max_steps = 300 # longest training demo has 270 steps
elif args.task_suite_name == "libero_10":
max_steps = 520 # longest training demo has 505 steps
elif args.task_suite_name == "libero_90":
max_steps = 400 # longest training demo has 373 steps
else:
# Fallback for custom task suites
max_steps = 520
# --- Evaluation Loop ---
total_episodes, total_successes = 0, 0
for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"):
# Get task
task = task_suite.get_task(task_id)
# Get default LIBERO initial states
initial_states = task_suite.get_task_init_states(task_id)
# Initialize LIBERO environment and task description
env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)
# Start episodes
task_episodes, task_successes = 0, 0
for episode_idx in tqdm(
range(min(args.num_trials_per_task, len(initial_states))),
desc=f"Task {task_id}: {task.language}",
leave=False,
):
logging.info(f"\nTask: {task_description}")
# Reset environment and policy
env.reset()
policy.reset()
# Set initial states
obs = env.set_init_state(initial_states[episode_idx])
# IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
# and we need to wait for them to fall
for _ in range(args.num_steps_wait):
obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION)
# Setup
t = 0
frames = []
done = False
# Add initial frame
agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
# frames.append(agentview_image)
# import ipdb; ipdb.set_trace()
logging.info(f"Starting episode {task_episodes+1}...")
while t < max_steps:
try:
# Get preprocessed image
# IMPORTANT: rotate 180 degrees to match train preprocessing
wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
frames.append(agentview_image)
# Prepare observations dict
state = np.concatenate(
(
obs["robot0_eef_pos"],
_quat2axisangle(obs["robot0_eef_quat"]),
obs["robot0_gripper_qpos"],
)
)
observation = {
"observation.images.image": torch.from_numpy(agentview_image / 255.0)
.permute(2, 0, 1)
.to(torch.float32)
.to(args.device).unsqueeze(0),
"observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0)
.permute(2, 0, 1)
.to(torch.float32)
.to(args.device).unsqueeze(0),
"observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0),
"task": task_description,
}
# Query model to get action
with torch.inference_mode():
action_tensor = policy.select_action(observation)
action = action_tensor.cpu().numpy()[0]
action[-1] = 1 - action[-1]
# Execute action in environment
obs, _, done, _ = env.step(action)
if done:
task_successes += 1
total_successes += 1
break
t += 1
except Exception as e:
logging.error(f"Caught exception: {e}")
break
task_episodes += 1
total_episodes += 1
# Save a replay video of the episode
suffix = "success" if done else "failure"
task_segment = task_description.replace(" ", "_").replace("/", "_")
video_path = (
pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4"
)
fps = 30
writer = imageio.get_writer(video_path, fps=fps)
for image in frames:
writer.append_data(image)
writer.close()
logging.info(f"Saved video to {video_path}")
# Log current results
logging.info(f"Success: {done}")
if total_episodes > 0:
logging.info(f"# episodes completed so far: {total_episodes}")
logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")
# Log final results for the task
if task_episodes > 0:
logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}")
if total_episodes > 0:
logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}")
logging.info("--- Evaluation finished ---")
if total_episodes > 0:
logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}")
logging.info(f"Total episodes: {total_episodes}")
logging.info(f"Total successes: {total_successes}")
cv2.destroyAllWindows()
def _get_libero_env(task, resolution, seed):
"""Initializes and returns the LIBERO environment, along with the task description."""
task_description = task.language
task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file
env_args = {
"bddl_file_name": str(task_bddl_file),
"camera_heights": resolution,
"camera_widths": resolution,
}
env = OffScreenRenderEnv(**env_args)
env.seed(seed) # IMPORTANT: seed seems to affect object positions even when using fixed initial state
return env, task_description
def _quat2axisangle(quat):
"""
Copied from robosuite:
https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
"""
# clip quaternion
if quat[3] > 1.0:
quat[3] = 1.0
elif quat[3] < -1.0:
quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3])
if math.isclose(den, 0.0):
# This is (close to) a zero degree rotation, immediately return
return np.zeros(3)
return (quat[:3] * 2.0 * math.acos(quat[3])) / den
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
eval_libero()
- some success result
https://github.com/user-attachments/assets/9ba1c8b0-b6af-4013-8df3-94a0e9db36cb
@zlw21gxy Thanks for giving evaluation code, I want to try pretrained weight to do zero-shot but got this how to get the stats
INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.
@zlw21gxy Thanks for giving evaluation code, I want to try pretrained weight to do zero-shot but got this how to get the stats
INFO:root:# episodes completed so far: 10 Task: pick up the black bowl between the plate and the ramekin and place it on the plate ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.
You may check the camera name which you passed the model is align with the dataset or not.
@zijian0615 thx for replying, what do you mean the camera name? I know that policy = SmolVLAPolicy.from_pretrained('lerobot/smolvla_base') what I need is the dataset_stats but how to get this using pretrained model
For training, I'm using the following command:
python lerobot/scripts/train.py
--policy.type=smolvla
--dataset.repo_id=~/.cache/huggingface/hub/datasets--aopolin-lv--libero_spatial_no_noops_lerobot_v21/snapshots/bfe55b41cc3103d672ad7204257c9bdc547410cb
--batch_size=64
--steps=200000 I'm specifically using the dataset found at: https://huggingface.co/datasets/aopolin-lv/libero_spatial_no_noops_lerobot_v21This dataset supports LeRobot dataset
v21, as the OpenPI LIBERO dataset (which isv20) is not compatible with the current setup.For evaluation, I've written a simple script,
eval_LIBERO.py, inspired by the OpenPI examples. I run it with the following command:python lerobot/scripts/eval_LIBERO.py --policy_path=outputs/train/2025-06-30/21-19-58_smolvla/checkpoints/last/pretrained_model Both training and evaluation are conducted on the
libero_spatialtask. Despite this, the success rate I'm observing is very low. I suspect there might be a mistake in my setup, but the overall pipeline seems functional. Anyone interested in replicating this can follow these steps. Notice I think that the gripper action seems reverse between dataset and simulator.And it is maybe due to the dataset is not correctly convert, this link maybe helpful https://github.com/Tavish9/any4lerobot/tree/main/libero2lerobot
- eval_LIBERO.py
""" This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark. """
import collections import dataclasses import logging import math import pathlib
import cv2 import draccus import imageio import numpy as np import torch from libero.libero import benchmark, get_libero_path from libero.libero.envs import OffScreenRenderEnv from tqdm import tqdm
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0] LIBERO_ENV_RESOLUTION = 256 # resolution used to render training data
@dataclasses.dataclass class Args: """ Evaluation arguments for smolVLA on LIBERO. """
# --- Hugging Face arguments --- policy_path: str = "lerobot/smolvla_base" """Path to the pretrained policy on the Hugging Face Hub or local directory.""" # --- LIBERO environment-specific parameters --- task_suite_name: str = "libero_spatial" """Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90""" num_steps_wait: int = 10 """Number of steps to wait for objects to stabilize in sim.""" num_trials_per_task: int = 50 """Number of rollouts per task.""" # --- Evaluation arguments --- video_out_path: str = "data/libero/videos" """Path to save videos.""" device: str = "cuda" """Device to use for evaluation.""" seed: int = 7 """Random Seed (for reproducibility)"""@draccus.wrap() def eval_libero(args: Args) -> None: # Set random seed torch.manual_seed(args.seed) np.random.seed(args.seed)
# --- Load Policy --- policy = SmolVLAPolicy.from_pretrained(args.policy_path) policy.to(args.device) policy.eval() # --- Initialize LIBERO task suite --- benchmark_dict = benchmark.get_benchmark_dict() try: task_suite = benchmark_dict[args.task_suite_name]() except KeyError: raise ValueError( f"Unknown task suite: {args.task_suite_name}. " f"Available options are: {list(benchmark_dict.keys())}" ) num_tasks_in_suite = task_suite.n_tasks logging.info(f"Task suite: {args.task_suite_name}") pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True) if args.task_suite_name == "libero_spatial": max_steps = 220 # longest training demo has 193 steps elif args.task_suite_name == "libero_object": max_steps = 280 # longest training demo has 254 steps elif args.task_suite_name == "libero_goal": max_steps = 300 # longest training demo has 270 steps elif args.task_suite_name == "libero_10": max_steps = 520 # longest training demo has 505 steps elif args.task_suite_name == "libero_90": max_steps = 400 # longest training demo has 373 steps else: # Fallback for custom task suites max_steps = 520 # --- Evaluation Loop --- total_episodes, total_successes = 0, 0 for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"): # Get task task = task_suite.get_task(task_id) # Get default LIBERO initial states initial_states = task_suite.get_task_init_states(task_id) # Initialize LIBERO environment and task description env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed) # Start episodes task_episodes, task_successes = 0, 0 for episode_idx in tqdm( range(min(args.num_trials_per_task, len(initial_states))), desc=f"Task {task_id}: {task.language}", leave=False, ): logging.info(f"\nTask: {task_description}") # Reset environment and policy env.reset() policy.reset() # Set initial states obs = env.set_init_state(initial_states[episode_idx]) # IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects # and we need to wait for them to fall for _ in range(args.num_steps_wait): obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION) # Setup t = 0 frames = [] done = False # Add initial frame agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1]) # frames.append(agentview_image) # import ipdb; ipdb.set_trace() logging.info(f"Starting episode {task_episodes+1}...") while t < max_steps: try: # Get preprocessed image # IMPORTANT: rotate 180 degrees to match train preprocessing wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1]) agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1]) frames.append(agentview_image) # Prepare observations dict state = np.concatenate( ( obs["robot0_eef_pos"], _quat2axisangle(obs["robot0_eef_quat"]), obs["robot0_gripper_qpos"], ) ) observation = { "observation.images.image": torch.from_numpy(agentview_image / 255.0) .permute(2, 0, 1) .to(torch.float32) .to(args.device).unsqueeze(0), "observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0) .permute(2, 0, 1) .to(torch.float32) .to(args.device).unsqueeze(0), "observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0), "task": task_description, } # Query model to get action with torch.inference_mode(): action_tensor = policy.select_action(observation) action = action_tensor.cpu().numpy()[0] action[-1] = 1 - action[-1] # Execute action in environment obs, _, done, _ = env.step(action) if done: task_successes += 1 total_successes += 1 break t += 1 except Exception as e: logging.error(f"Caught exception: {e}") break task_episodes += 1 total_episodes += 1 # Save a replay video of the episode suffix = "success" if done else "failure" task_segment = task_description.replace(" ", "_").replace("/", "_") video_path = ( pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4" ) fps = 30 writer = imageio.get_writer(video_path, fps=fps) for image in frames: writer.append_data(image) writer.close() logging.info(f"Saved video to {video_path}") # Log current results logging.info(f"Success: {done}") if total_episodes > 0: logging.info(f"# episodes completed so far: {total_episodes}") logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)") # Log final results for the task if task_episodes > 0: logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}") if total_episodes > 0: logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}") logging.info("--- Evaluation finished ---") if total_episodes > 0: logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}") logging.info(f"Total episodes: {total_episodes}") logging.info(f"Total successes: {total_successes}") cv2.destroyAllWindows()def _get_libero_env(task, resolution, seed): """Initializes and returns the LIBERO environment, along with the task description.""" task_description = task.language task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file env_args = { "bddl_file_name": str(task_bddl_file), "camera_heights": resolution, "camera_widths": resolution, } env = OffScreenRenderEnv(**env_args) env.seed(seed) # IMPORTANT: seed seems to affect object positions even when using fixed initial state return env, task_description
def _quat2axisangle(quat): """ Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55 """ # clip quaternion if quat[3] > 1.0: quat[3] = 1.0 elif quat[3] < -1.0: quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3]) if math.isclose(den, 0.0): # This is (close to) a zero degree rotation, immediately return return np.zeros(3) return (quat[:3] * 2.0 * math.acos(quat[3])) / denif name == "main": logging.basicConfig(level=logging.INFO) eval_libero() 2. some success result
rollout_task_0_episode_5_pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate_success.mp4
Hello, I followed your instructions, but when my training steps reach around 120,000, the loss stays at 0.015 and won't decrease any further. Moreover, the test success rate remains 0 all the time. May I ask if it is necessary to train up to 200,000 steps to see results? Or could there be a version issue with my pip environment? Could you please share your pip environment, such as the versions of Python libraries like torch and torchvision?
A quick update: following OpenVLA, I modified the evaluation script and achieved approximately a 72% success rate on the libero_spatial environment. I believe this version of the evaluation script is correct. However, the model was trained only on the libero_spatial dataset, so further improvements may require training on a larger dataset.
"""
This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark.
"""
import collections
import dataclasses
import logging
import math
import pathlib
import os
import cv2
import draccus
import imageio
import numpy as np
import torch
from libero.libero import benchmark, get_libero_path
from libero.libero.envs import OffScreenRenderEnv
from tqdm import tqdm
from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
os.environ["TOKENIZERS_PARALLELISM"] = "false"
LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0]
LIBERO_ENV_RESOLUTION = 256 # resolution used to render training data
def normalize_gripper_action(action, binarize=True):
"""
Changes gripper action (last dimension of action vector) from [0,1] to [-1,+1].
Necessary for some environments (not Bridge) because the dataset wrapper standardizes gripper actions to [0,1].
Note that unlike the other action dimensions, the gripper action is not normalized to [-1,+1] by default by
the dataset wrapper.
Normalization formula: y = 2 * (x - orig_low) / (orig_high - orig_low) - 1
"""
# Just normalize the last action to [-1,+1].
orig_low, orig_high = 0.0, 1.0
action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1
if binarize:
# Binarize to -1 or +1.
action[..., -1] = np.sign(action[..., -1])
return action
def invert_gripper_action(action):
"""
Flips the sign of the gripper action (last dimension of action vector).
This is necessary for some environments where -1 = open, +1 = close, since
the RLDS dataloader aligns gripper actions such that 0 = close, 1 = open.
"""
action[..., -1] = action[..., -1] * -1.0
return action
@dataclasses.dataclass
class Args:
"""
Evaluation arguments for smolVLA on LIBERO.
"""
# --- Hugging Face arguments ---
policy_path: str = "lerobot/smolvla_base"
"""Path to the pretrained policy on the Hugging Face Hub or local directory."""
# --- LIBERO environment-specific parameters ---
task_suite_name: str = "libero_spatial"
"""Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
num_steps_wait: int = 10
"""Number of steps to wait for objects to stabilize in sim."""
num_trials_per_task: int = 50
"""Number of rollouts per task."""
# --- Evaluation arguments ---
video_out_path: str = "data/libero/videos"
"""Path to save videos."""
device: str = "cuda"
"""Device to use for evaluation."""
seed: int = 7
"""Random Seed (for reproducibility)"""
@draccus.wrap()
def eval_libero(args: Args) -> None:
# Set random seed
torch.manual_seed(args.seed)
np.random.seed(args.seed)
# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.to(args.device)
policy.eval()
# --- Initialize LIBERO task suite ---
benchmark_dict = benchmark.get_benchmark_dict()
try:
task_suite = benchmark_dict[args.task_suite_name]()
except KeyError:
raise ValueError(
f"Unknown task suite: {args.task_suite_name}. "
f"Available options are: {list(benchmark_dict.keys())}"
)
num_tasks_in_suite = task_suite.n_tasks
logging.info(f"Task suite: {args.task_suite_name}")
pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)
if args.task_suite_name == "libero_spatial":
max_steps = 220 # longest training demo has 193 steps
elif args.task_suite_name == "libero_object":
max_steps = 280 # longest training demo has 254 steps
elif args.task_suite_name == "libero_goal":
max_steps = 300 # longest training demo has 270 steps
elif args.task_suite_name == "libero_10":
max_steps = 520 # longest training demo has 505 steps
elif args.task_suite_name == "libero_90":
max_steps = 400 # longest training demo has 373 steps
else:
# Fallback for custom task suites
max_steps = 520
# --- Evaluation Loop ---
total_episodes, total_successes = 0, 0
for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"):
# Get task
task = task_suite.get_task(task_id)
# Get default LIBERO initial states
initial_states = task_suite.get_task_init_states(task_id)
# Initialize LIBERO environment and task description
env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)
# Start episodes
task_episodes, task_successes = 0, 0
for episode_idx in tqdm(
range(min(args.num_trials_per_task, len(initial_states))),
desc=f"Task {task_id}: {task.language}",
leave=False,
):
logging.info(f"\nTask: {task_description}")
# Reset environment and policy
env.reset()
policy.reset()
# Set initial states
obs = env.set_init_state(initial_states[episode_idx])
# IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
# and we need to wait for them to fall
for _ in range(args.num_steps_wait):
obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION)
# Setup
t = 0
frames = []
done = False
# Add initial frame
agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
# frames.append(agentview_image)
# import ipdb; ipdb.set_trace()
logging.info(f"Starting episode {task_episodes+1}...")
while t < max_steps:
try:
# Get preprocessed image
# IMPORTANT: rotate 180 degrees to match train preprocessing
wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
frames.append(agentview_image)
# Prepare observations dict
state = np.concatenate(
(
obs["robot0_eef_pos"],
_quat2axisangle(obs["robot0_eef_quat"]),
obs["robot0_gripper_qpos"],
)
)
observation = {
"observation.images.image": torch.from_numpy(agentview_image / 255.0)
.permute(2, 0, 1)
.to(torch.float32)
.to(args.device).unsqueeze(0),
"observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0)
.permute(2, 0, 1)
.to(torch.float32)
.to(args.device).unsqueeze(0),
"observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0),
"task": task_description,
}
# Query model to get action
with torch.inference_mode():
action_tensor = policy.select_action(observation)
action = action_tensor.cpu().numpy()[0]
# action[-1] = 1 - action[-1]
action = normalize_gripper_action(action, binarize=False)
action = invert_gripper_action(action)
# Execute action in environment
obs, _, done, _ = env.step(action)
if done:
task_successes += 1
total_successes += 1
break
t += 1
except Exception as e:
logging.error(f"Caught exception: {e}")
break
task_episodes += 1
total_episodes += 1
# Save a replay video of the episode
suffix = "success" if done else "failure"
task_segment = task_description.replace(" ", "_").replace("/", "_")
video_path = (
pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4"
)
fps = 30
writer = imageio.get_writer(video_path, fps=fps)
for image in frames:
writer.append_data(image)
writer.close()
logging.info(f"Saved video to {video_path}")
# import ipdb; ipdb.set_trace()
# Log current results
logging.info(f"Success: {done}")
if total_episodes > 0:
logging.info(f"# episodes completed so far: {total_episodes}")
logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")
# Log final results for the task
if task_episodes > 0:
logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}")
if total_episodes > 0:
logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}")
logging.info("--- Evaluation finished ---")
if total_episodes > 0:
logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}")
logging.info(f"Total episodes: {total_episodes}")
logging.info(f"Total successes: {total_successes}")
# cv2.destroyAllWindows()
def _get_libero_env(task, resolution, seed):
"""Initializes and returns the LIBERO environment, along with the task description."""
task_description = task.language
task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file
env_args = {
"bddl_file_name": str(task_bddl_file),
"camera_heights": resolution,
"camera_widths": resolution,
}
env = OffScreenRenderEnv(**env_args)
env.seed(seed) # IMPORTANT: seed seems to affect object positions even when using fixed initial state
return env, task_description
def _quat2axisangle(quat):
"""
Copied from robosuite:
https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
"""
# clip quaternion
if quat[3] > 1.0:
quat[3] = 1.0
elif quat[3] < -1.0:
quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3])
if math.isclose(den, 0.0):
# This is (close to) a zero degree rotation, immediately return
return np.zeros(3)
return (quat[:3] * 2.0 * math.acos(quat[3])) / den
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
eval_libero()
I’m using the lerobot repository at commit 483be9aac217c2d8ef16982490f22b2ad091ab46. The versions of some key packages are as follows:
torch 2.7.1
torchaudio 2.7.1
torchcodec 0.4.0
torchvision 0.22.1
I've also tried zlw21gxy's SmolVLA training and evaluation on LIBERO-spatial, thanks for sharing.
Trained with code below like provided with trivial modifications.
python lerobot/scripts/train.py \
--policy.type=smolvla \
--dataset.repo_id=aopolin-lv/libero_spatial_no_noops_lerobot_v21 \
--batch_size=64 \
--steps=200000 \
--policy.device=cuda \
--wandb.enable=true \
--save_freq 10000 \
--output_dir=outputs/trainv5/libero_smolvla_scratch \
--job_name=libero_smolvla_scratch
Training loss stayed near 0.02 from step 25k.
Success rate for LIBERO spatial was 26%, does this result aligns with your "low success rate"?
The tasks usually fails due to pre-closing gripper behavior, while getting the most of the task locations right.
Update: Just running the updated code, success rate seem to be in low 70s. Thanks for the update!
I also running the dataset using https://huggingface.co/datasets/openvla/modified_libero_rlds, and using the training code of the lerobot , the command is below
python lerobot/scripts/train.py --policy.path=lerobot/smolvla_base --dataset.repo_id=your_hf_username/libero --dataset.root=/work/.cache/huggingface/lerobot/your_hf_username/libero --batch_size=64 --steps=30000 --policy.push_to_hub=false --wandb.enable=True
I also found that my loss is down to 0.01, but when doing the validation got very low accuracy due to the same issue of gripper opening & closing, but I think the position & rotation seem right, below is the demo & loss graph
https://github.com/user-attachments/assets/a560c699-5803-4bed-b14e-c22d55aeae60 https://github.com/user-attachments/assets/85dbb1e1-3775-457b-99bd-338a4574d21c https://github.com/user-attachments/assets/ed052a15-7515-4a34-a845-1f79adcf246c
Here's an update on my progress: I've been training on the aopolin-lv/libero_object_no_noops_lerobot_v21 dataset. When I fine-tuned using smolvla base and ran the test script provided by zlw21gxy, the success rate was 66%. However, according to the paper, smolvla was trained from scratch on the Libero dataset. So I tried doing the same with the command:
python lerobot/scripts/train.py --policy.type=smolvla --policy.vlm_model_name=/home/hdp/smolvla/SmolVLM2-500M-Video-Instruct --dataset.repo_id=aopolin-lv/libero_object_no_noops_lerobot_v21 --dataset.root=/home/hdp/smolvla/libero_object_v21 --batch_size=128 --steps=200000
But the success rate is 0%.
Did I do something wrong? @zlw21gxy @nikriz1
@hahans You can try the following command as @nikriz1. And uses the latest evaluation script I updated, and I believe it should work. This setup trains the model from scratch:
python lerobot/scripts/train.py \
--policy.type=smolvla \
--dataset.repo_id=aopolin-lv/libero_spatial_no_noops_lerobot_v21 \
--batch_size=64 \
--steps=200000 \
--policy.device=cuda \
--wandb.enable=true \
--save_freq 10000 \
--output_dir=outputs/trainv5/libero_smolvla_scratch \
--job_name=libero_smolvla_scratch
@zlw21gxy Thank your for sharing the scripts. I notice that both you and @nikriz1 train from scratch on LIBERO spatial dataset. Can we also fine-tune the smolvla_base model using this dataset and then do evaluation? Thanks for your help!
Hi guys, have you tried the metaworld benchmark? I am reproducing the pi0 on the metaworld, but the performance is low ...
@zlw21gxy Thank your for sharing the scripts. I notice that both you and @nikriz1 train from scratch on LIBERO spatial dataset. Can we also fine-tune the smolvla_base model using this dataset and then do evaluation? Thanks for your help!
My guess is that fine-tuning from smolvla_base model might actually hurt performance, especially if the dataset contains cross-embodiment data. It could confuse the model rather than help.
Appreciate any thoughts or clarification on this!
@zlw21gxy can you tell the smolvla performance you test on libero, I got 70~75% on libero_spatial using pretrained smolvla_base
@zlw21gxy你能告诉我你在 libero 上测试的 smolvla 性能吗?我使用预训练的 smolvla_base 在 libero_spatial 上获得了 70% 到 75% 的成绩
I’m using the evaluation code mentioned earlier to directly evaluate the released model lerobot/smolvla_base on the libero_spatial dataset.
In addition, when I tried to follow the official example of finetuning the SmolVLA neural network (with pretrained VLM and the action expert initialized from scratch), I encountered a different error during evaluation on the libero_spatial dataset:
ERROR:root:Caught exception: The size of tensor a (8) must match the size of tensor b (6) at non-singleton dimension 1
May I ask how you evaluated the pretrained smolvla_base model on the libero_spatial benchmark? I’d appreciate any clarification or working configuration details. By the way, following the same setup mentioned earlier, I trained on the aopolin-lv/libero_spatial_no_noops_lerobot_v21 dataset. At 6k training steps, the model achieved a 67% success rate on the libero_spatial benchmark, and at 8k steps, the success rate was 65%. Due to machine interruptions, I was only able to complete up to 8k steps. Just for your reference.
@zlw21gxy感谢您提供评估代码,我想尝试使用预训练权重进行零样本训练, 但不知道该如何获取统计数据
INFO:root:# episodes completed so far: 10 Task: pick up the black bowl between the plate and the ramekin and place it on the plate ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.
Hi, have you found a solution to this issue? I’m facing the same problem when trying to evaluate the pretrained weights directly on the libero_spatial dataset.
@QZepHyr Maybe you should change config of the model ['observation.state']['shape'] to 8 and ['action']['shape'] to 7 manually, or skip loading buffer_observation_state and buffer_action in load_smolvla function.
@JustinKai0527 @zlw21gxy Regarding low performance, I found an important sentence in 4.3 Implementation details
In simulation, we perform inference by sampling new observations and predicting a new action after each executed action.
So, the proposed performance was executing single step per inference.
My final success rate for LIBERO-spatial was 66.8% (maybe I had a bad seed) when trained with LIBERO-spatial from scratch executing full action chunk (50), which improved to 82% when executing single step per inference. I think this could be much higher when applying this to your models with success rates 7x%.
@chenkang455 maybe changing this could also improve performance on metaworld.
@bairuofei @hahans To train with dataset combining all four tasks of LIBERO, I merged datasets shared by aopoli-lv and uploaded on link.
Then, currently trying to train with both from scratch and from smolvla_base, for 100k steps as stated in the paper.
For fine-tuning on simulation benchmarks, we train for 100,000 steps with a batch size of 64.
Results of all trials are as follow, some are ongoing.
| Training | Steps | Eval Trials | Act H | Spatial | Object | Goal | 10 |
|---|---|---|---|---|---|---|---|
| Paper | 100k | 100 | 1 | 90 | 96 | 92 | 71 |
| Spatial scratch | 200k | 500 | 50 | 66.8 | - | - | - |
| Spatial scratch | 200k | 100 | 1 | 82 | - | - | - |
| All scratch | 100k | 500 | 50 | 66 | 67.2 | 75 | 31 |
| All scratch | 100k | 100 | 1 | 82 | 86 | 83 | 45 |
| All finetune | 100k | 100 | 1 | 86 | 88 | 83 | 46 |
Update: Setting execution horizon to 1 gave the most gains and finetuning from smolvla_base did help a little, but failed to reproduce paper results. Manually going through evaluation results, there are some cases where task is solved but is checked as failure like below.
LIBERO-goal: open the middle drawer of the cabinet
@nikriz1 You're right — it appears that lerobot/smolvla_base was not pretrained using the lerobot/svla_so100_stacking dataset. The svla_so100_stacking dataset provides observations of length 6 and actions of length 6, whereas the libero_spatial benchmark expects observations and actions of lengths 8 and 7, respectively.
By the way, have you encountered the following issue when directly evaluating with lerobot/smolvla_base? If so, how did you resolve it? It seems that the released weights might be missing some components.
INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.
@nikriz1 You're right — it appears that lerobot/smolvla_base was not pretrained using the lerobot/svla_so100_stacking dataset. The svla_so100_stacking dataset provides observations of length 6 and actions of length 6, whereas the libero_spatial benchmark expects observations and actions of lengths 8 and 7, respectively.
By the way, have you encountered the following issue when directly evaluating with lerobot/smolvla_base? If so, how did you resolve it? It seems that the released weights might be missing some components.
INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception:meanis infinity. You should either initialize withstatsas an argument, or use a pretrained model.
Sorry I never tried smolvla_base directly without training on LIBERO
@nikriz1
I trained a model from scratch on the dataset nikriz/aopoli-lv-libero_combined_no_noops_lerobot_v21, which should correspond to the following configuration from your table:
| Training | Steps | Eval Trials | Act H | Spatial | Object | Goal | 10 |
|---|---|---|---|---|---|---|---|
| All scratch (community) | 100k | 100 | 1 | 82.0 | 86.0 | 83.0 | 45.0 |
My evaluation result on the Spatial task was 85.0, slightly higher than reported.
However, on the Object task, the model failed to complete any tasks (success rate = 0). A sample video is attached at the end, along with the full training and evaluation setup below. Could you help me understand whether there's anything wrong in my code, or if any of my settings differ from yours?
Training Configuration
python lerobot/src/lerobot/scripts/train.py \
--policy.type=smolvla \
--dataset.repo_id=smol_vla/dataset_lerobot \
--batch_size=64 \
--steps=100000 \
--output_dir=model_finetune_all \
--job_name=my_smolvla_finetuning \
--policy.device=cuda \
--wandb.enable=true \
--policy.push_to_hub=false
Evaluation Setup
# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.config.n_action_steps = 1
policy.to(args.device)
policy.eval()
@dataclasses.dataclass
class Args:
"""
Evaluation arguments for smolVLA on LIBERO.
"""
# --- Hugging Face arguments ---
policy_path: str = "model_pretraining_all/checkpoints/last/pretrained_model"
"""Path to the pretrained policy on the Hugging Face Hub or local directory."""
# --- LIBERO environment-specific parameters ---
task_suite_name: str = "libero_spatial"
"""Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
num_steps_wait: int = 10
"""Number of steps to wait for objects to stabilize in sim."""
num_trials_per_task: int = 100 # 50
"""Number of rollouts per task."""
# --- Evaluation arguments ---
video_out_path: str = "video"
"""Path to save videos."""
device: str = "cuda"
"""Device to use for evaluation."""
seed: int = 7
"""Random Seed (for reproducibility)"""
Video
https://github.com/user-attachments/assets/5f84214f-7b83-44cc-a93a-e2836b391048
@zlw21gxy thank you very much for your evaluation script. I was trying to run it but I encountered a problem. I am not sure if it's a problem with LIBERO installation? I've done accordingly to your steps you have described so far including training it with the same setup. The error that I am encountering is:
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
(1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([numpy.core.multiarray._reconstruct])` or the `torch.serialization.safe_globals([numpy.core.multiarray._reconstruct])` context manager to allowlist this global if you trust this class/function.
What would this error be about?
Update: I've added the following snippet to the eval script. The code seems to be running, but my success rate seems to have reduced to around 50% instead of 70% as described. Does it mean setting the weights_only to false affects the performance?
_orig_torch_load = torch.load
def _unsafe_torch_load(f, *args, **kwargs):
return _orig_torch_load(f, *args, weights_only=False, **kwargs)
torch.load = _unsafe_torch_load
@QZepHyr Maybe you should change config of the model ['observation.state']['shape'] to 8 and ['action']['shape'] to 7 manually, or skip loading buffer_observation_state and buffer_action in load_smolvla function.
@JustinKai0527 @zlw21gxy Regarding low performance, I found an important sentence in 4.3 Implementation details
In simulation, we perform inference by sampling new observations and predicting a new action after each executed action.
So, the proposed performance was executing single step per inference.
My final success rate for LIBERO-spatial was 66.8% (maybe I had a bad seed) when trained with LIBERO-spatial from scratch executing full action chunk (50), which improved to 82% when executing single step per inference. I think this could be much higher when applying this to your models with success rates 7x%.
@chenkang455 maybe changing this could also improve performance on metaworld.
@bairuofei @hahans To train with dataset combining all four tasks of LIBERO, I merged datasets shared by aopoli-lv and uploaded on link.
Then, currently trying to train with both from scratch and from smolvla_base, for 100k steps as stated in the paper.
For fine-tuning on simulation benchmarks, we train for 100,000 steps with a batch size of 64.
Results of all trials are as follow, some are ongoing.
Training Steps Eval Trials Act H Spatial Object Goal 10 Paper 100k 100 1 90 96 92 71 Spatial scratch 200k 500 50 66.8 - - - Spatial scratch 200k 100 1 82 - - - All scratch 100k 500 50 66 67.2 75 31 All scratch 100k 100 1 82 86 83 45 All finetune 100k 100 1 86 88 83 46 Update: Setting execution horizon to 1 gave the most gains and finetuning from smolvla_base did help a little, but failed to reproduce paper results. Manually going through evaluation results, there are some cases where task is solved but is checked as failure like below.
@nikriz1 so using action chunk = 1 will get better performance?
@JustinKai0527 Yes, that’s right. I got similar results when evaluating on Libero-Spatial — from 65% to 77%. Due to limited computational resources, I only trained for 80k steps, but still arrived at a similar conclusion.
# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.config.n_action_steps = 1
policy.to(args.device)
policy.eval()
@QZepHyr I change my evaluation code but the accuracy destroy, is that the training also change to action chunk = 1?