lerobot [Question] SmolVLA LIBERO / MetaWorld evaluation

Hello, thank you for open sourcing this wonderful repository. I have read the SmolVLA paper impressively and tried to run some evaluations.

In Section 4.5 of the paper, under Simulation Evaluation, it seems that you have fine-tuned the SmolVLA baseline to the Franka Emika Panda and the Swayer arm to perform evaluation on the LIBERO and MetaSim benchmark respectively. Could you elaborate on the details of the fine-tuning process? (which parameters were trained/frozen, optimizer, gradient steps, etc..) I am planning to reproduce the results.

Thank you.

Jun 16 '25 06:06 tykim0507

Hi, I also had a similar question regarding the fine-tuning process for the LIBERO and Meta-World evaluations.

If you're able to share any pretrained weights or evaluation scripts for these benchmarks, that would be greatly appreciated. Any related visualizations or logs would also be very helpful.

Thanks again for your excellent work and for open-sourcing the project!

Jun 16 '25 10:06 bigchou

@tykim0507 @bigchou

Hi, I also tried to evaluate SmolVLA, but I am not sure how to evaluate it in the LIBERO and Meta-World. It seems that Lerobot does not directly support evaluation in the LIBERO and Meta-World.

If you could share how to evaluate the model trained in the lerobot framework in LIBERO and Meta-World, it would be greatly appreciated.

Thanks again.

Jun 18 '25 14:06 CarolBaggins2023

Same problem.

Jun 20 '25 19:06 LukeLIN-web

Same problem.

Jun 26 '25 05:06 ustcwhy

Hi team,

I'm currently working on reproducing results and training models on the LIBERO dataset for evaluation within the LIBERO simulation environment. While I've been able to train a model and evaluate it in the simulator, the observed success rate is extremely low.

Initially, I attempted to use the OpenPI LIBERO dataset. However, I've noticed that the LeRobot dataset version seems to be undergoing changes, and the action definitions appear to differ from what I'm expecting. This inconsistency makes it challenging to achieve satisfactory performance.

It would be immensely helpful if the team could provide an open-source example, similar to the OpenPI approach, specifically demonstrating how to effectively train and evaluate models using the LIBERO dataset and simulation environment. Such an example would be invaluable for the community and for users like myself who are trying to achieve higher success rates.

I plan to upload my current script for reference once I've made further progress, in case it can be helpful to others.

Thank you for considering this request!

Jul 02 '25 05:07 zlw21gxy

For training, I'm using the following command:

python lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=~/.cache/huggingface/hub/datasets--aopolin-lv--libero_spatial_no_noops_lerobot_v21/snapshots/bfe55b41cc3103d672ad7204257c9bdc547410cb \
  --batch_size=64 \
  --steps=200000

I'm specifically using the dataset found at: https://huggingface.co/datasets/aopolin-lv/libero_spatial_no_noops_lerobot_v21

This dataset supports LeRobot dataset v21, as the OpenPI LIBERO dataset (which is v20) is not compatible with the current setup.

For evaluation, I've written a simple script, eval_LIBERO.py, inspired by the OpenPI examples. I run it with the following command:

python lerobot/scripts/eval_LIBERO.py --policy_path=outputs/train/2025-06-30/21-19-58_smolvla/checkpoints/last/pretrained_model

Both training and evaluation are conducted on the libero_spatial task. Despite this, the success rate I'm observing is very low. I suspect there might be a mistake in my setup, but the overall pipeline seems functional. Anyone interested in replicating this can follow these steps. Notice I think that the gripper action seems reverse between dataset and simulator.

And it is maybe due to the dataset is not correctly convert, this link maybe helpful https://github.com/Tavish9/any4lerobot/tree/main/libero2lerobot

eval_LIBERO.py

"""
This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark.
"""

import collections
import dataclasses
import logging
import math
import pathlib

import cv2
import draccus
import imageio
import numpy as np
import torch
from libero.libero import benchmark, get_libero_path
from libero.libero.envs import OffScreenRenderEnv
from tqdm import tqdm

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0]
LIBERO_ENV_RESOLUTION = 256  # resolution used to render training data

@dataclasses.dataclass
class Args:
    """
    Evaluation arguments for smolVLA on LIBERO.
    """

    # --- Hugging Face arguments ---
    policy_path: str = "lerobot/smolvla_base"
    """Path to the pretrained policy on the Hugging Face Hub or local directory."""

    # --- LIBERO environment-specific parameters ---
    task_suite_name: str = "libero_spatial"
    """Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
    num_steps_wait: int = 10
    """Number of steps to wait for objects to stabilize in sim."""
    num_trials_per_task: int = 50
    """Number of rollouts per task."""

    # --- Evaluation arguments ---
    video_out_path: str = "data/libero/videos"
    """Path to save videos."""
    device: str = "cuda"
    """Device to use for evaluation."""

    seed: int = 7
    """Random Seed (for reproducibility)"""


@draccus.wrap()
def eval_libero(args: Args) -> None:
    # Set random seed
    torch.manual_seed(args.seed)
    np.random.seed(args.seed)

    # --- Load Policy ---
    policy = SmolVLAPolicy.from_pretrained(args.policy_path)
    policy.to(args.device)
    policy.eval()

    # --- Initialize LIBERO task suite ---
    benchmark_dict = benchmark.get_benchmark_dict()
    try:
        task_suite = benchmark_dict[args.task_suite_name]()
    except KeyError:
        raise ValueError(
            f"Unknown task suite: {args.task_suite_name}. "
            f"Available options are: {list(benchmark_dict.keys())}"
        )
    num_tasks_in_suite = task_suite.n_tasks
    logging.info(f"Task suite: {args.task_suite_name}")

    pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)

    if args.task_suite_name == "libero_spatial":
        max_steps = 220  # longest training demo has 193 steps
    elif args.task_suite_name == "libero_object":
        max_steps = 280  # longest training demo has 254 steps
    elif args.task_suite_name == "libero_goal":
        max_steps = 300  # longest training demo has 270 steps
    elif args.task_suite_name == "libero_10":
        max_steps = 520  # longest training demo has 505 steps
    elif args.task_suite_name == "libero_90":
        max_steps = 400  # longest training demo has 373 steps
    else:
        # Fallback for custom task suites
        max_steps = 520

    # --- Evaluation Loop ---
    total_episodes, total_successes = 0, 0
    for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"):
        # Get task
        task = task_suite.get_task(task_id)

        # Get default LIBERO initial states
        initial_states = task_suite.get_task_init_states(task_id)

        # Initialize LIBERO environment and task description
        env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)

        # Start episodes
        task_episodes, task_successes = 0, 0
        for episode_idx in tqdm(
            range(min(args.num_trials_per_task, len(initial_states))),
            desc=f"Task {task_id}: {task.language}",
            leave=False,
        ):
            logging.info(f"\nTask: {task_description}")

            # Reset environment and policy
            env.reset()
            policy.reset()

            # Set initial states
            obs = env.set_init_state(initial_states[episode_idx])

            # IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
            # and we need to wait for them to fall
            for _ in range(args.num_steps_wait):
                obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION)

            # Setup
            t = 0
            frames = []
            done = False

            # Add initial frame
            agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
            # frames.append(agentview_image)
            # import ipdb; ipdb.set_trace()
            logging.info(f"Starting episode {task_episodes+1}...")
            while t < max_steps:
                try:
                    # Get preprocessed image
                    # IMPORTANT: rotate 180 degrees to match train preprocessing
                    wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
                    agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
                    frames.append(agentview_image)

                    # Prepare observations dict
                    state = np.concatenate(
                        (
                            obs["robot0_eef_pos"],
                            _quat2axisangle(obs["robot0_eef_quat"]),
                            obs["robot0_gripper_qpos"],
                        )
                    )
                    observation = {
                        "observation.images.image": torch.from_numpy(agentview_image / 255.0)
                        .permute(2, 0, 1)
                        .to(torch.float32)
                        .to(args.device).unsqueeze(0),
                        "observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0)
                        .permute(2, 0, 1)
                        .to(torch.float32)
                        .to(args.device).unsqueeze(0),
                        "observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0),
                        "task": task_description,
                    }

                    # Query model to get action
                    with torch.inference_mode():
                        action_tensor = policy.select_action(observation)
                    action = action_tensor.cpu().numpy()[0]
                    action[-1] = 1 -  action[-1]

                    # Execute action in environment
                    obs, _, done, _ = env.step(action)
                    if done:
                        task_successes += 1
                        total_successes += 1
                        break
                    t += 1

                except Exception as e:
                    logging.error(f"Caught exception: {e}")
                    break

            task_episodes += 1
            total_episodes += 1

            # Save a replay video of the episode
            suffix = "success" if done else "failure"
            task_segment = task_description.replace(" ", "_").replace("/", "_")
            video_path = (
                pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4"
            )
            fps = 30
            writer = imageio.get_writer(video_path, fps=fps)

            for image in frames:
                writer.append_data(image)
            writer.close()
            logging.info(f"Saved video to {video_path}")

            # Log current results
            logging.info(f"Success: {done}")
            if total_episodes > 0:
                logging.info(f"# episodes completed so far: {total_episodes}")
                logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")

        # Log final results for the task
        if task_episodes > 0:
            logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}")
        if total_episodes > 0:
            logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}")

    logging.info("--- Evaluation finished ---")
    if total_episodes > 0:
        logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}")
    logging.info(f"Total episodes: {total_episodes}")
    logging.info(f"Total successes: {total_successes}")
    cv2.destroyAllWindows()


def _get_libero_env(task, resolution, seed):
    """Initializes and returns the LIBERO environment, along with the task description."""
    task_description = task.language
    task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file
    env_args = {
        "bddl_file_name": str(task_bddl_file),
        "camera_heights": resolution,
        "camera_widths": resolution,
    }
    env = OffScreenRenderEnv(**env_args)
    env.seed(seed)  # IMPORTANT: seed seems to affect object positions even when using fixed initial state
    return env, task_description


def _quat2axisangle(quat):
    """
    Copied from robosuite:
    https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
    """
    # clip quaternion
    if quat[3] > 1.0:
        quat[3] = 1.0
    elif quat[3] < -1.0:
        quat[3] = -1.0

    den = np.sqrt(1.0 - quat[3] * quat[3])
    if math.isclose(den, 0.0):
        # This is (close to) a zero degree rotation, immediately return
        return np.zeros(3)

    return (quat[:3] * 2.0 * math.acos(quat[3])) / den


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    eval_libero()

some success result

https://github.com/user-attachments/assets/9ba1c8b0-b6af-4013-8df3-94a0e9db36cb

Jul 02 '25 06:07 zlw21gxy

@zlw21gxy Thanks for giving evaluation code, I want to try pretrained weight to do zero-shot but got this how to get the stats

INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.

Jul 08 '25 06:07 JustinKai0527

@zlw21gxy Thanks for giving evaluation code, I want to try pretrained weight to do zero-shot but got this how to get the stats
INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.

You may check the camera name which you passed the model is align with the dataset or not.

Jul 09 '25 16:07 zijian0615

@zijian0615 thx for replying, what do you mean the camera name? I know that policy = SmolVLAPolicy.from_pretrained('lerobot/smolvla_base') what I need is the dataset_stats but how to get this using pretrained model

Jul 10 '25 11:07 JustinKai0527

For training, I'm using the following command:

python lerobot/scripts/train.py
--policy.type=smolvla
--dataset.repo_id=~/.cache/huggingface/hub/datasets--aopolin-lv--libero_spatial_no_noops_lerobot_v21/snapshots/bfe55b41cc3103d672ad7204257c9bdc547410cb
--batch_size=64
--steps=200000 I'm specifically using the dataset found at: https://huggingface.co/datasets/aopolin-lv/libero_spatial_no_noops_lerobot_v21

This dataset supports LeRobot dataset v21, as the OpenPI LIBERO dataset (which is v20) is not compatible with the current setup.

For evaluation, I've written a simple script, eval_LIBERO.py, inspired by the OpenPI examples. I run it with the following command:

python lerobot/scripts/eval_LIBERO.py --policy_path=outputs/train/2025-06-30/21-19-58_smolvla/checkpoints/last/pretrained_model Both training and evaluation are conducted on the libero_spatial task. Despite this, the success rate I'm observing is very low. I suspect there might be a mistake in my setup, but the overall pipeline seems functional. Anyone interested in replicating this can follow these steps. Notice I think that the gripper action seems reverse between dataset and simulator.

And it is maybe due to the dataset is not correctly convert, this link maybe helpful https://github.com/Tavish9/any4lerobot/tree/main/libero2lerobot

eval_LIBERO.py

""" This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark. """

import collections import dataclasses import logging import math import pathlib

import cv2 import draccus import imageio import numpy as np import torch from libero.libero import benchmark, get_libero_path from libero.libero.envs import OffScreenRenderEnv from tqdm import tqdm

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0] LIBERO_ENV_RESOLUTION = 256 # resolution used to render training data

@dataclasses.dataclass class Args: """ Evaluation arguments for smolVLA on LIBERO. """
# --- Hugging Face arguments ---
policy_path: str = "lerobot/smolvla_base"
"""Path to the pretrained policy on the Hugging Face Hub or local directory."""

# --- LIBERO environment-specific parameters ---
task_suite_name: str = "libero_spatial"
"""Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
num_steps_wait: int = 10
"""Number of steps to wait for objects to stabilize in sim."""
num_trials_per_task: int = 50
"""Number of rollouts per task."""

# --- Evaluation arguments ---
video_out_path: str = "data/libero/videos"
"""Path to save videos."""
device: str = "cuda"
"""Device to use for evaluation."""

seed: int = 7
"""Random Seed (for reproducibility)"""
@draccus.wrap() def eval_libero(args: Args) -> None: # Set random seed torch.manual_seed(args.seed) np.random.seed(args.seed)
# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.to(args.device)
policy.eval()

# --- Initialize LIBERO task suite ---
benchmark_dict = benchmark.get_benchmark_dict()
try:
    task_suite = benchmark_dict[args.task_suite_name]()
except KeyError:
    raise ValueError(
        f"Unknown task suite: {args.task_suite_name}. "
        f"Available options are: {list(benchmark_dict.keys())}"
    )
num_tasks_in_suite = task_suite.n_tasks
logging.info(f"Task suite: {args.task_suite_name}")

pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)

if args.task_suite_name == "libero_spatial":
    max_steps = 220  # longest training demo has 193 steps
elif args.task_suite_name == "libero_object":
    max_steps = 280  # longest training demo has 254 steps
elif args.task_suite_name == "libero_goal":
    max_steps = 300  # longest training demo has 270 steps
elif args.task_suite_name == "libero_10":
    max_steps = 520  # longest training demo has 505 steps
elif args.task_suite_name == "libero_90":
    max_steps = 400  # longest training demo has 373 steps
else:
    # Fallback for custom task suites
    max_steps = 520

# --- Evaluation Loop ---
total_episodes, total_successes = 0, 0
for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"):
    # Get task
    task = task_suite.get_task(task_id)

    # Get default LIBERO initial states
    initial_states = task_suite.get_task_init_states(task_id)

    # Initialize LIBERO environment and task description
    env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)

    # Start episodes
    task_episodes, task_successes = 0, 0
    for episode_idx in tqdm(
        range(min(args.num_trials_per_task, len(initial_states))),
        desc=f"Task {task_id}: {task.language}",
        leave=False,
    ):
        logging.info(f"\nTask: {task_description}")

        # Reset environment and policy
        env.reset()
        policy.reset()

        # Set initial states
        obs = env.set_init_state(initial_states[episode_idx])

        # IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
        # and we need to wait for them to fall
        for _ in range(args.num_steps_wait):
            obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION)

        # Setup
        t = 0
        frames = []
        done = False

        # Add initial frame
        agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
        # frames.append(agentview_image)
        # import ipdb; ipdb.set_trace()
        logging.info(f"Starting episode {task_episodes+1}...")
        while t < max_steps:
            try:
                # Get preprocessed image
                # IMPORTANT: rotate 180 degrees to match train preprocessing
                wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
                agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
                frames.append(agentview_image)

                # Prepare observations dict
                state = np.concatenate(
                    (
                        obs["robot0_eef_pos"],
                        _quat2axisangle(obs["robot0_eef_quat"]),
                        obs["robot0_gripper_qpos"],
                    )
                )
                observation = {
                    "observation.images.image": torch.from_numpy(agentview_image / 255.0)
                    .permute(2, 0, 1)
                    .to(torch.float32)
                    .to(args.device).unsqueeze(0),
                    "observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0)
                    .permute(2, 0, 1)
                    .to(torch.float32)
                    .to(args.device).unsqueeze(0),
                    "observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0),
                    "task": task_description,
                }

                # Query model to get action
                with torch.inference_mode():
                    action_tensor = policy.select_action(observation)
                action = action_tensor.cpu().numpy()[0]
                action[-1] = 1 -  action[-1]

                # Execute action in environment
                obs, _, done, _ = env.step(action)
                if done:
                    task_successes += 1
                    total_successes += 1
                    break
                t += 1

            except Exception as e:
                logging.error(f"Caught exception: {e}")
                break

        task_episodes += 1
        total_episodes += 1

        # Save a replay video of the episode
        suffix = "success" if done else "failure"
        task_segment = task_description.replace(" ", "_").replace("/", "_")
        video_path = (
            pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4"
        )
        fps = 30
        writer = imageio.get_writer(video_path, fps=fps)

        for image in frames:
            writer.append_data(image)
        writer.close()
        logging.info(f"Saved video to {video_path}")

        # Log current results
        logging.info(f"Success: {done}")
        if total_episodes > 0:
            logging.info(f"# episodes completed so far: {total_episodes}")
            logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")

    # Log final results for the task
    if task_episodes > 0:
        logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}")
    if total_episodes > 0:
        logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}")

logging.info("--- Evaluation finished ---")
if total_episodes > 0:
    logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}")
logging.info(f"Total episodes: {total_episodes}")
logging.info(f"Total successes: {total_successes}")
cv2.destroyAllWindows()
def _get_libero_env(task, resolution, seed): """Initializes and returns the LIBERO environment, along with the task description.""" task_description = task.language task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file env_args = { "bddl_file_name": str(task_bddl_file), "camera_heights": resolution, "camera_widths": resolution, } env = OffScreenRenderEnv(**env_args) env.seed(seed) # IMPORTANT: seed seems to affect object positions even when using fixed initial state return env, task_description

def _quat2axisangle(quat): """ Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55 """ # clip quaternion if quat[3] > 1.0: quat[3] = 1.0 elif quat[3] < -1.0: quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3])
if math.isclose(den, 0.0):
    # This is (close to) a zero degree rotation, immediately return
    return np.zeros(3)

return (quat[:3] * 2.0 * math.acos(quat[3])) / den
if name == "main": logging.basicConfig(level=logging.INFO) eval_libero() 2. some success result

rollout_task_0_episode_5_pick_up_the_black_bowl_between_the_plate_and_the_ramekin_and_place_it_on_the_plate_success.mp4

Hello, I followed your instructions, but when my training steps reach around 120,000, the loss stays at 0.015 and won't decrease any further. Moreover, the test success rate remains 0 all the time. May I ask if it is necessary to train up to 200,000 steps to see results? Or could there be a version issue with my pip environment? Could you please share your pip environment, such as the versions of Python libraries like torch and torchvision?

Jul 12 '25 14:07 hahans

A quick update: following OpenVLA, I modified the evaluation script and achieved approximately a 72% success rate on the libero_spatial environment. I believe this version of the evaluation script is correct. However, the model was trained only on the libero_spatial dataset, so further improvements may require training on a larger dataset.

"""
This script demonstrates how to evaluate a pretrained smolVLA policy on the LIBERO benchmark.
"""

import collections
import dataclasses
import logging
import math
import pathlib
import os

import cv2
import draccus
import imageio
import numpy as np
import torch
from libero.libero import benchmark, get_libero_path
from libero.libero.envs import OffScreenRenderEnv
from tqdm import tqdm

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy

os.environ["TOKENIZERS_PARALLELISM"] = "false"

LIBERO_DUMMY_ACTION = [0.0] * 6 + [-1.0]
LIBERO_ENV_RESOLUTION = 256  # resolution used to render training data



def normalize_gripper_action(action, binarize=True):
    """
    Changes gripper action (last dimension of action vector) from [0,1] to [-1,+1].
    Necessary for some environments (not Bridge) because the dataset wrapper standardizes gripper actions to [0,1].
    Note that unlike the other action dimensions, the gripper action is not normalized to [-1,+1] by default by
    the dataset wrapper.

    Normalization formula: y = 2 * (x - orig_low) / (orig_high - orig_low) - 1
    """
    # Just normalize the last action to [-1,+1].
    orig_low, orig_high = 0.0, 1.0
    action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1

    if binarize:
        # Binarize to -1 or +1.
        action[..., -1] = np.sign(action[..., -1])

    return action


def invert_gripper_action(action):
    """
    Flips the sign of the gripper action (last dimension of action vector).
    This is necessary for some environments where -1 = open, +1 = close, since
    the RLDS dataloader aligns gripper actions such that 0 = close, 1 = open.
    """
    action[..., -1] = action[..., -1] * -1.0
    return action


@dataclasses.dataclass
class Args:
    """
    Evaluation arguments for smolVLA on LIBERO.
    """

    # --- Hugging Face arguments ---
    policy_path: str = "lerobot/smolvla_base"
    """Path to the pretrained policy on the Hugging Face Hub or local directory."""

    # --- LIBERO environment-specific parameters ---
    task_suite_name: str = "libero_spatial"
    """Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
    num_steps_wait: int = 10
    """Number of steps to wait for objects to stabilize in sim."""
    num_trials_per_task: int = 50
    """Number of rollouts per task."""

    # --- Evaluation arguments ---
    video_out_path: str = "data/libero/videos"
    """Path to save videos."""
    device: str = "cuda"
    """Device to use for evaluation."""

    seed: int = 7
    """Random Seed (for reproducibility)"""


@draccus.wrap()
def eval_libero(args: Args) -> None:
    # Set random seed
    torch.manual_seed(args.seed)
    np.random.seed(args.seed)

    # --- Load Policy ---
    policy = SmolVLAPolicy.from_pretrained(args.policy_path)
    policy.to(args.device)
    policy.eval()

    # --- Initialize LIBERO task suite ---
    benchmark_dict = benchmark.get_benchmark_dict()
    try:
        task_suite = benchmark_dict[args.task_suite_name]()
    except KeyError:
        raise ValueError(
            f"Unknown task suite: {args.task_suite_name}. "
            f"Available options are: {list(benchmark_dict.keys())}"
        )
    num_tasks_in_suite = task_suite.n_tasks
    logging.info(f"Task suite: {args.task_suite_name}")

    pathlib.Path(args.video_out_path).mkdir(parents=True, exist_ok=True)

    if args.task_suite_name == "libero_spatial":
        max_steps = 220  # longest training demo has 193 steps
    elif args.task_suite_name == "libero_object":
        max_steps = 280  # longest training demo has 254 steps
    elif args.task_suite_name == "libero_goal":
        max_steps = 300  # longest training demo has 270 steps
    elif args.task_suite_name == "libero_10":
        max_steps = 520  # longest training demo has 505 steps
    elif args.task_suite_name == "libero_90":
        max_steps = 400  # longest training demo has 373 steps
    else:
        # Fallback for custom task suites
        max_steps = 520

    # --- Evaluation Loop ---
    total_episodes, total_successes = 0, 0
    for task_id in tqdm(range(num_tasks_in_suite), desc="Tasks"):
        # Get task
        task = task_suite.get_task(task_id)

        # Get default LIBERO initial states
        initial_states = task_suite.get_task_init_states(task_id)

        # Initialize LIBERO environment and task description
        env, task_description = _get_libero_env(task, LIBERO_ENV_RESOLUTION, args.seed)

        # Start episodes
        task_episodes, task_successes = 0, 0
        for episode_idx in tqdm(
            range(min(args.num_trials_per_task, len(initial_states))),
            desc=f"Task {task_id}: {task.language}",
            leave=False,
        ):
            logging.info(f"\nTask: {task_description}")

            # Reset environment and policy
            env.reset()
            policy.reset()

            # Set initial states
            obs = env.set_init_state(initial_states[episode_idx])

            # IMPORTANT: Do nothing for the first few timesteps because the simulator drops objects
            # and we need to wait for them to fall
            for _ in range(args.num_steps_wait):
                obs, _, _, _ = env.step(LIBERO_DUMMY_ACTION)

            # Setup
            t = 0
            frames = []
            done = False

            # Add initial frame
            agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
            # frames.append(agentview_image)
            # import ipdb; ipdb.set_trace()
            logging.info(f"Starting episode {task_episodes+1}...")
            while t < max_steps:
                try:
                    # Get preprocessed image
                    # IMPORTANT: rotate 180 degrees to match train preprocessing
                    wrist_img = np.ascontiguousarray(obs["robot0_eye_in_hand_image"][::-1, ::-1])
                    agentview_image = np.ascontiguousarray(obs["agentview_image"][::-1, ::-1])
                    frames.append(agentview_image)

                    # Prepare observations dict
                    state = np.concatenate(
                        (
                            obs["robot0_eef_pos"],
                            _quat2axisangle(obs["robot0_eef_quat"]),
                            obs["robot0_gripper_qpos"],
                        )
                    )
                    observation = {
                        "observation.images.image": torch.from_numpy(agentview_image / 255.0)
                        .permute(2, 0, 1)
                        .to(torch.float32)
                        .to(args.device).unsqueeze(0),
                        "observation.images.wrist_image": torch.from_numpy(wrist_img / 255.0)
                        .permute(2, 0, 1)
                        .to(torch.float32)
                        .to(args.device).unsqueeze(0),
                        "observation.state": torch.from_numpy(state).to(torch.float32).to(args.device).unsqueeze(0),
                        "task": task_description,
                    }

                    # Query model to get action
                    with torch.inference_mode():
                        action_tensor = policy.select_action(observation)
                    action = action_tensor.cpu().numpy()[0]
                    # action[-1] = 1 - action[-1]
                    action = normalize_gripper_action(action, binarize=False)
                    action = invert_gripper_action(action)

                    # Execute action in environment
                    obs, _, done, _ = env.step(action)
                    if done:
                        task_successes += 1
                        total_successes += 1
                        break
                    t += 1

                except Exception as e:
                    logging.error(f"Caught exception: {e}")
                    break

            task_episodes += 1
            total_episodes += 1

            # Save a replay video of the episode
            suffix = "success" if done else "failure"
            task_segment = task_description.replace(" ", "_").replace("/", "_")
            video_path = (
                pathlib.Path(args.video_out_path) / f"rollout_task_{task_id}_episode_{episode_idx}_{task_segment}_{suffix}.mp4"
            )
            fps = 30
            writer = imageio.get_writer(video_path, fps=fps)

            for image in frames:
                writer.append_data(image)
            writer.close()
            logging.info(f"Saved video to {video_path}")
            # import ipdb; ipdb.set_trace()

            # Log current results
            logging.info(f"Success: {done}")
            if total_episodes > 0:
                logging.info(f"# episodes completed so far: {total_episodes}")
                logging.info(f"# successes: {total_successes} ({total_successes / total_episodes * 100:.1f}%)")

        # Log final results for the task
        if task_episodes > 0:
            logging.info(f"Task {task_id} success rate: {float(task_successes) / float(task_episodes):.2f}")
        if total_episodes > 0:
            logging.info(f"Cumulative success rate: {float(total_successes) / float(total_episodes):.2f}")

    logging.info("--- Evaluation finished ---")
    if total_episodes > 0:
        logging.info(f"Total success rate: {float(total_successes) / float(total_episodes):.2f}")
    logging.info(f"Total episodes: {total_episodes}")
    logging.info(f"Total successes: {total_successes}")
    # cv2.destroyAllWindows()


def _get_libero_env(task, resolution, seed):
    """Initializes and returns the LIBERO environment, along with the task description."""
    task_description = task.language
    task_bddl_file = pathlib.Path(get_libero_path("bddl_files")) / task.problem_folder / task.bddl_file
    env_args = {
        "bddl_file_name": str(task_bddl_file),
        "camera_heights": resolution,
        "camera_widths": resolution,
    }
    env = OffScreenRenderEnv(**env_args)
    env.seed(seed)  # IMPORTANT: seed seems to affect object positions even when using fixed initial state
    return env, task_description


def _quat2axisangle(quat):
    """
    Copied from robosuite:
    https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
    """
    # clip quaternion
    if quat[3] > 1.0:
        quat[3] = 1.0
    elif quat[3] < -1.0:
        quat[3] = -1.0

    den = np.sqrt(1.0 - quat[3] * quat[3])
    if math.isclose(den, 0.0):
        # This is (close to) a zero degree rotation, immediately return
        return np.zeros(3)

    return (quat[:3] * 2.0 * math.acos(quat[3])) / den


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    eval_libero()

I’m using the lerobot repository at commit 483be9aac217c2d8ef16982490f22b2ad091ab46. The versions of some key packages are as follows:

torch                     2.7.1
torchaudio                2.7.1
torchcodec                0.4.0
torchvision               0.22.1

Jul 16 '25 04:07 zlw21gxy

I've also tried zlw21gxy's SmolVLA training and evaluation on LIBERO-spatial, thanks for sharing.

Trained with code below like provided with trivial modifications.

python lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=aopolin-lv/libero_spatial_no_noops_lerobot_v21 \
  --batch_size=64 \
  --steps=200000 \
  --policy.device=cuda \
  --wandb.enable=true \
  --save_freq 10000 \
  --output_dir=outputs/trainv5/libero_smolvla_scratch \
  --job_name=libero_smolvla_scratch

Training loss stayed near 0.02 from step 25k.

Success rate for LIBERO spatial was 26%, does this result aligns with your "low success rate"?

The tasks usually fails due to pre-closing gripper behavior, while getting the most of the task locations right.

Update: Just running the updated code, success rate seem to be in low 70s. Thanks for the update!

Jul 16 '25 06:07 nikriz1

I also running the dataset using https://huggingface.co/datasets/openvla/modified_libero_rlds, and using the training code of the lerobot , the command is below python lerobot/scripts/train.py --policy.path=lerobot/smolvla_base --dataset.repo_id=your_hf_username/libero --dataset.root=/work/.cache/huggingface/lerobot/your_hf_username/libero --batch_size=64 --steps=30000 --policy.push_to_hub=false --wandb.enable=True I also found that my loss is down to 0.01, but when doing the validation got very low accuracy due to the same issue of gripper opening & closing, but I think the position & rotation seem right, below is the demo & loss graph

https://github.com/user-attachments/assets/a560c699-5803-4bed-b14e-c22d55aeae60 https://github.com/user-attachments/assets/85dbb1e1-3775-457b-99bd-338a4574d21c https://github.com/user-attachments/assets/ed052a15-7515-4a34-a845-1f79adcf246c

Jul 18 '25 02:07 JustinKai0527

Here's an update on my progress: I've been training on the aopolin-lv/libero_object_no_noops_lerobot_v21 dataset. When I fine-tuned using smolvla base and ran the test script provided by zlw21gxy, the success rate was 66%. However, according to the paper, smolvla was trained from scratch on the Libero dataset. So I tried doing the same with the command:

python lerobot/scripts/train.py --policy.type=smolvla --policy.vlm_model_name=/home/hdp/smolvla/SmolVLM2-500M-Video-Instruct --dataset.repo_id=aopolin-lv/libero_object_no_noops_lerobot_v21 --dataset.root=/home/hdp/smolvla/libero_object_v21 --batch_size=128 --steps=200000

But the success rate is 0%.

Did I do something wrong? @zlw21gxy @nikriz1

Jul 19 '25 11:07 hahans

@hahans You can try the following command as @nikriz1. And uses the latest evaluation script I updated, and I believe it should work. This setup trains the model from scratch:

python lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=aopolin-lv/libero_spatial_no_noops_lerobot_v21 \
  --batch_size=64 \
  --steps=200000 \
  --policy.device=cuda \
  --wandb.enable=true \
  --save_freq 10000 \
  --output_dir=outputs/trainv5/libero_smolvla_scratch \
  --job_name=libero_smolvla_scratch

Jul 21 '25 01:07 zlw21gxy

@zlw21gxy Thank your for sharing the scripts. I notice that both you and @nikriz1 train from scratch on LIBERO spatial dataset. Can we also fine-tune the smolvla_base model using this dataset and then do evaluation? Thanks for your help!

Jul 21 '25 09:07 bairuofei

Hi guys, have you tried the metaworld benchmark? I am reproducing the pi0 on the metaworld, but the performance is low ...

Jul 23 '25 06:07 chenkang455

@zlw21gxy Thank your for sharing the scripts. I notice that both you and @nikriz1 train from scratch on LIBERO spatial dataset. Can we also fine-tune the smolvla_base model using this dataset and then do evaluation? Thanks for your help!

My guess is that fine-tuning from smolvla_base model might actually hurt performance, especially if the dataset contains cross-embodiment data. It could confuse the model rather than help.

Appreciate any thoughts or clarification on this!

Jul 23 '25 08:07 zlw21gxy

@zlw21gxy can you tell the smolvla performance you test on libero, I got 70~75% on libero_spatial using pretrained smolvla_base

Jul 23 '25 08:07 JustinKai0527

@zlw21gxy你能告诉我你在 libero 上测试的 smolvla 性能吗？我使用预训练的 smolvla_base 在 libero_spatial 上获得了 70% 到 75% 的成绩

I’m using the evaluation code mentioned earlier to directly evaluate the released model lerobot/smolvla_base on the libero_spatial dataset.

In addition, when I tried to follow the official example of finetuning the SmolVLA neural network (with pretrained VLM and the action expert initialized from scratch), I encountered a different error during evaluation on the libero_spatial dataset:

ERROR:root:Caught exception: The size of tensor a (8) must match the size of tensor b (6) at non-singleton dimension 1

May I ask how you evaluated the pretrained smolvla_base model on the libero_spatial benchmark? I’d appreciate any clarification or working configuration details. By the way, following the same setup mentioned earlier, I trained on the aopolin-lv/libero_spatial_no_noops_lerobot_v21 dataset. At 6k training steps, the model achieved a 67% success rate on the libero_spatial benchmark, and at 8k steps, the success rate was 65%. Due to machine interruptions, I was only able to complete up to 8k steps. Just for your reference.

Jul 24 '25 01:07 QZepHyr

@zlw21gxy感谢您提供评估代码，我想尝试使用预训练权重进行零样本训练，但不知道该如何获取统计数据
INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.

Hi, have you found a solution to this issue? I’m facing the same problem when trying to evaluate the pretrained weights directly on the libero_spatial dataset.

Jul 24 '25 01:07 QZepHyr

@QZepHyr Maybe you should change config of the model ['observation.state']['shape'] to 8 and ['action']['shape'] to 7 manually, or skip loading buffer_observation_state and buffer_action in load_smolvla function.

@JustinKai0527 @zlw21gxy Regarding low performance, I found an important sentence in 4.3 Implementation details

In simulation, we perform inference by sampling new observations and predicting a new action after each executed action.

So, the proposed performance was executing single step per inference.

My final success rate for LIBERO-spatial was 66.8% (maybe I had a bad seed) when trained with LIBERO-spatial from scratch executing full action chunk (50), which improved to 82% when executing single step per inference. I think this could be much higher when applying this to your models with success rates 7x%.

@chenkang455 maybe changing this could also improve performance on metaworld.

@bairuofei @hahans To train with dataset combining all four tasks of LIBERO, I merged datasets shared by aopoli-lv and uploaded on link.

Then, currently trying to train with both from scratch and from smolvla_base, for 100k steps as stated in the paper.

For fine-tuning on simulation benchmarks, we train for 100,000 steps with a batch size of 64.

Results of all trials are as follow, some are ongoing.

Training	Steps	Eval Trials	Act H	Spatial	Object	Goal	10
Paper	100k	100	1	90	96	92	71
Spatial scratch	200k	500	50	66.8	-	-	-
Spatial scratch	200k	100	1	82	-	-	-
All scratch	100k	500	50	66	67.2	75	31
All scratch	100k	100	1	82	86	83	45
All finetune	100k	100	1	86	88	83	46

Update: Setting execution horizon to 1 gave the most gains and finetuning from smolvla_base did help a little, but failed to reproduce paper results. Manually going through evaluation results, there are some cases where task is solved but is checked as failure like below.

LIBERO-goal: open the middle drawer of the cabinet

Jul 24 '25 06:07 nikriz1

@nikriz1 You're right — it appears that lerobot/smolvla_base was not pretrained using the lerobot/svla_so100_stacking dataset. The svla_so100_stacking dataset provides observations of length 6 and actions of length 6, whereas the libero_spatial benchmark expects observations and actions of lengths 8 and 7, respectively.

By the way, have you encountered the following issue when directly evaluating with lerobot/smolvla_base? If so, how did you resolve it? It seems that the released weights might be missing some components.

INFO:root:# episodes completed so far: 10  
Task: pick up the black bowl between the plate and the ramekin and place it on the plate  
ERROR:root:Caught exception: `mean` is infinity. You should either initialize with `stats` as an argument, or use a pretrained model.

Jul 24 '25 07:07 QZepHyr

@nikriz1 You're right — it appears that lerobot/smolvla_base was not pretrained using the lerobot/svla_so100_stacking dataset. The svla_so100_stacking dataset provides observations of length 6 and actions of length 6, whereas the libero_spatial benchmark expects observations and actions of lengths 8 and 7, respectively.

By the way, have you encountered the following issue when directly evaluating with lerobot/smolvla_base? If so, how did you resolve it? It seems that the released weights might be missing some components.

INFO:root:# episodes completed so far: 10
Task: pick up the black bowl between the plate and the ramekin and place it on the plate
ERROR:root:Caught exception: mean is infinity. You should either initialize with stats as an argument, or use a pretrained model.

Sorry I never tried smolvla_base directly without training on LIBERO

Jul 24 '25 11:07 nikriz1

@nikriz1

I trained a model from scratch on the dataset nikriz/aopoli-lv-libero_combined_no_noops_lerobot_v21, which should correspond to the following configuration from your table:

Training	Steps	Eval Trials	Act H	Spatial	Object	Goal	10
All scratch (community)	100k	100	1	82.0	86.0	83.0	45.0

My evaluation result on the Spatial task was 85.0, slightly higher than reported.

However, on the Object task, the model failed to complete any tasks (success rate = 0). A sample video is attached at the end, along with the full training and evaluation setup below. Could you help me understand whether there's anything wrong in my code, or if any of my settings differ from yours?

Training Configuration

python lerobot/src/lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=smol_vla/dataset_lerobot \
  --batch_size=64 \
  --steps=100000 \
  --output_dir=model_finetune_all \
  --job_name=my_smolvla_finetuning \
  --policy.device=cuda \
  --wandb.enable=true \
  --policy.push_to_hub=false

Evaluation Setup

# --- Load Policy ---
policy = SmolVLAPolicy.from_pretrained(args.policy_path)
policy.config.n_action_steps = 1
policy.to(args.device)
policy.eval()

@dataclasses.dataclass
class Args:
    """
    Evaluation arguments for smolVLA on LIBERO.
    """

    # --- Hugging Face arguments ---
    policy_path: str = "model_pretraining_all/checkpoints/last/pretrained_model"
    """Path to the pretrained policy on the Hugging Face Hub or local directory."""

    # --- LIBERO environment-specific parameters ---
    task_suite_name: str = "libero_spatial"
    """Task suite. Options: libero_spatial, libero_object, libero_goal, libero_10, libero_90"""
    num_steps_wait: int = 10
    """Number of steps to wait for objects to stabilize in sim."""
    num_trials_per_task: int = 100  # 50
    """Number of rollouts per task."""

    # --- Evaluation arguments ---
    video_out_path: str = "video"
    """Path to save videos."""
    device: str = "cuda"
    """Device to use for evaluation."""

    seed: int = 7
    """Random Seed (for reproducibility)"""

Video

https://github.com/user-attachments/assets/5f84214f-7b83-44cc-a93a-e2836b391048

Jul 27 '25 08:07 QZepHyr

@zlw21gxy thank you very much for your evaluation script. I was trying to run it but I encountered a problem. I am not sure if it's a problem with LIBERO installation? I've done accordingly to your steps you have described so far including training it with the same setup. The error that I am encountering is:

raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([numpy.core.multiarray._reconstruct])` or the `torch.serialization.safe_globals([numpy.core.multiarray._reconstruct])` context manager to allowlist this global if you trust this class/function.

What would this error be about?

Jul 28 '25 04:07 heimrih

Update: I've added the following snippet to the eval script. The code seems to be running, but my success rate seems to have reduced to around 50% instead of 70% as described. Does it mean setting the weights_only to false affects the performance?

_orig_torch_load = torch.load

def _unsafe_torch_load(f, *args, **kwargs):
    return _orig_torch_load(f, *args, weights_only=False, **kwargs)

torch.load = _unsafe_torch_load

Jul 28 '25 05:07 heimrih

@QZepHyr Maybe you should change config of the model ['observation.state']['shape'] to 8 and ['action']['shape'] to 7 manually, or skip loading buffer_observation_state and buffer_action in load_smolvla function.

@JustinKai0527 @zlw21gxy Regarding low performance, I found an important sentence in 4.3 Implementation details

In simulation, we perform inference by sampling new observations and predicting a new action after each executed action.

So, the proposed performance was executing single step per inference.

My final success rate for LIBERO-spatial was 66.8% (maybe I had a bad seed) when trained with LIBERO-spatial from scratch executing full action chunk (50), which improved to 82% when executing single step per inference. I think this could be much higher when applying this to your models with success rates 7x%.

@chenkang455 maybe changing this could also improve performance on metaworld.

@bairuofei @hahans To train with dataset combining all four tasks of LIBERO, I merged datasets shared by aopoli-lv and uploaded on link.

Then, currently trying to train with both from scratch and from smolvla_base, for 100k steps as stated in the paper.

For fine-tuning on simulation benchmarks, we train for 100,000 steps with a batch size of 64.

Results of all trials are as follow, some are ongoing.

Training Steps Eval Trials Act H Spatial Object Goal 10 Paper 100k 100 1 90 96 92 71 Spatial scratch 200k 500 50 66.8 - - - Spatial scratch 200k 100 1 82 - - - All scratch 100k 500 50 66 67.2 75 31 All scratch 100k 100 1 82 86 83 45 All finetune 100k 100 1 86 88 83 46 Update: Setting execution horizon to 1 gave the most gains and finetuning from smolvla_base did help a little, but failed to reproduce paper results. Manually going through evaluation results, there are some cases where task is solved but is checked as failure like below.

LIBERO-goal: open the middle drawer of the cabinet

@nikriz1 so using action chunk = 1 will get better performance?

Jul 28 '25 05:07 JustinKai0527

@JustinKai0527 Yes, that’s right. I got similar results when evaluating on Libero-Spatial — from 65% to 77%. Due to limited computational resources, I only trained for 80k steps, but still arrived at a similar conclusion.

    # --- Load Policy ---
    policy = SmolVLAPolicy.from_pretrained(args.policy_path)
    policy.config.n_action_steps = 1
    policy.to(args.device)
    policy.eval()

Jul 28 '25 05:07 QZepHyr

@QZepHyr I change my evaluation code but the accuracy destroy, is that the training also change to action chunk = 1?

Jul 28 '25 05:07 JustinKai0527