SimplerEnv icon indicating copy to clipboard operation
SimplerEnv copied to clipboard

openvla policy intergration

Open DelinQu opened this issue 1 year ago • 54 comments

This pull request integrates openvla policy. The evaluation scripts remain consistent with the original repo under ./scripts/

DelinQu avatar Jul 05 '24 10:07 DelinQu

Thanks for the contribution! Could you post the success rates for each task?

Could you post a source for the implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, sticky gripper match, and do they match the real eval?

Code comments:

  • Remove " + OpenVLA policy" in readme
  • OpenVALInference -> OpenVLAInference
  • self.sticky_action_is_on is False in L147 of openvla_model.py: change to (not self.sticky_action_is_on)

xuanlinli17 avatar Jul 05 '24 16:07 xuanlinli17

Hi xuanlinli17, I have corrected all the typos. The implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, etc., are migrating from octo. The evaluation results can be downloaded at report_openvla.log

DelinQu avatar Jul 06 '24 14:07 DelinQu

I see. Might want to get help from the official authors to validate / revise the implementation as the Bridge results are near zero for some reason, and pick coke can has large variance across different backgrounds. Additionally, it's possible that OpenVLA might not follow Octo implementation in real deployment.

xuanlinli17 avatar Jul 06 '24 15:07 xuanlinli17

Also you can modify https://github.com/simpler-env/SimplerEnv/blob/main/tools/calc_metrics_evaluation_videos.py to quickly summarize the results for OpenVLA (just put dummy numbers for the real numbers and don't push the script). You can ignore the nans.

xuanlinli17 avatar Jul 06 '24 15:07 xuanlinli17

I'm also working on implementing OpenVLA into SimplerEnv, and I had the same issue: OpenVLA fails drastically on Bridge. I wonder if that has anything to do with the controller mentioned in #11

HuFY-dev avatar Jul 07 '24 01:07 HuFY-dev

Same here, Severe lack of performance of OpenVLA on WindowX robot.

sombitd avatar Jul 08 '24 09:07 sombitd

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

xuanlinli17 avatar Jul 20 '24 00:07 xuanlinli17

OpenVLA setup requirements:

pip install torch==2.3.1 torchvision==0.18.1 timm==0.9.10 tokenizers==0.15.2 accelerate==0.32.1
pip install flash-attn==2.6.1 --no-build-isolation

Please add these instructions to Readme and add an "OpenVLA Inference Setup" section

xuanlinli17 avatar Jul 20 '24 18:07 xuanlinli17

Typos in scripts/openvla_put_in_drawer_variant_agg.sh and scripts/openvla_put_in_drawer_visual_matching.sh: should be "declare -a ckpt_paths=("openvla/openvla-7b")"

xuanlinli17 avatar Jul 21 '24 16:07 xuanlinli17

I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:

from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv


class OpenVLAInference:
    def __init__(
        self,
        saved_model_path: str = "openvla/openvla-7b",
        unnorm_key: Optional[str] = None,
        policy_setup: str = "widowx_bridge",
        horizon: int = 1,
        pred_action_horizon: int = 1,
        exec_horizon: int = 1,
        image_size: list[int] = [224, 224],
        action_scale: float = 1.0,
    ) -> None:
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        if policy_setup == "widowx_bridge":
            unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 1
        elif policy_setup == "google_robot":
            unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
            self.sticky_gripper_num_repeat = 15
        else:
            raise NotImplementedError(
                f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
            )
        self.policy_setup = policy_setup
        self.unnorm_key = unnorm_key

        print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
        self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
        self.vla = AutoModelForVision2Seq.from_pretrained(
            "openvla/openvla-7b",
            attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
        ).cuda()

        self.image_size = image_size
        self.action_scale = action_scale
        self.horizon = horizon
        self.pred_action_horizon = pred_action_horizon
        self.exec_horizon = exec_horizon

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

        self.task = None
        self.task_description = None
        self.num_image_history = 0

    def reset(self, task_description: str) -> None:
        self.task_description = task_description
        self.num_image_history = 0

        self.sticky_action_is_on = False
        self.gripper_action_repeat = 0
        self.sticky_gripper_action = 0.0
        self.previous_gripper_action = None

    def step(
        self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
    ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
        """
        Input:
            image: np.ndarray of shape (H, W, 3), uint8
            task_description: Optional[str], task description; if different from previous task description, policy state is reset
        Output:
            raw_action: dict; raw policy action output
            action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
                - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
                - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
                - 'gripper': np.ndarray of shape (1,), gripper action
                - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
        """
        if task_description is not None:
            if task_description != self.task_description:
                self.reset(task_description)

        assert image.dtype == np.uint8
        image = self._resize_image(image)

        image: Image.Image = Image.fromarray(image)
        prompt = task_description

        # predict action (7-dof; un-normalize for bridgev2)
        inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
        raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
        # print(f"*** raw actions {raw_actions} ***")

        raw_action = {
            "world_vector": np.array(raw_actions[0, :3]),
            "rotation_delta": np.array(raw_actions[0, 3:6]),
            "open_gripper": np.array(raw_actions[0, 6:7]),  # range [0, 1]; 1 = open; 0 = close
        }

        # process raw_action to obtain the action to be sent to the maniskill2 environment
        action = {}
        action["world_vector"] = raw_action["world_vector"] * self.action_scale
        action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
        roll, pitch, yaw = action_rotation_delta
        action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
        action_rotation_axangle = action_rotation_ax * action_rotation_angle
        action["rot_axangle"] = action_rotation_axangle * self.action_scale

        if self.policy_setup == "google_robot":
            current_gripper_action = raw_action["open_gripper"]
            if self.previous_gripper_action is None:
                relative_gripper_action = np.array([0])
            else:
                relative_gripper_action = self.previous_gripper_action - current_gripper_action
            self.previous_gripper_action = current_gripper_action

            if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
                self.sticky_action_is_on = True
                self.sticky_gripper_action = relative_gripper_action

            if self.sticky_action_is_on:
                self.gripper_action_repeat += 1
                relative_gripper_action = self.sticky_gripper_action

            if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
                self.sticky_action_is_on = False
                self.gripper_action_repeat = 0
                self.sticky_gripper_action = 0.0

            action["gripper"] = relative_gripper_action

        elif self.policy_setup == "widowx_bridge":
            action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0

        action["terminate_episode"] = np.array([0.0])

        return raw_action, action

    def _resize_image(self, image: np.ndarray) -> np.ndarray:
        image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
        return image

    def visualize_epoch(
        self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
    ) -> None:
        images = [self._resize_image(image) for image in images]
        ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]

        img_strip = np.concatenate(np.array(images[::3]), axis=1)

        # set up plt figure
        figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
        plt.rcParams.update({"font.size": 12})
        fig, axs = plt.subplot_mosaic(figure_layout)
        fig.set_size_inches([45, 10])

        # plot actions
        pred_actions = np.array(
            [
                np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
                for a in predicted_raw_actions
            ]
        )
        for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
            # actions have batch, horizon, dim, in this example we just take the first action for simplicity
            axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
            axs[action_label].set_title(action_label)
            axs[action_label].set_xlabel("Time in one episode")

        axs["image"].imshow(img_strip)
        axs["image"].set_xlabel("Time in one episode (subsampled)")
        plt.legend()
        plt.savefig(save_path)

Hello @xuanlinli17, I tried the code above but openvla still fails on widowx tasks. Is it possibly an implementation problem? Should I set any params for widowx?

erjiaxiao avatar Jul 22 '24 07:07 erjiaxiao

Yeah that's my finding too, but I don't think the authors did some special treatments to evaluate OpenVLA on Bridge.

There might be some coordinate transforms on Bridge which is different.

xuanlinli17 avatar Jul 22 '24 19:07 xuanlinli17

@xuanlinli17 Thank you! I will continue to look into this. If I find any additional information or solutions, I'll make sure to share it with you.

erjiaxiao avatar Jul 23 '24 00:07 erjiaxiao

When I run the scripts bash scripts/openvla_bridge.sh in ./SimplerEnv-OpenVLA, an error appears:

from simpler_env.policies.openvla.openvla_model import OpenVLAInference

ModuleNotFoundError: No module named 'simpler_env.policies.openvla'

How to fix it?

RZFan525 avatar Jul 28 '24 05:07 RZFan525

^ Run pip install -e . in the SimplerEnv-OpenVLA repo. Also I don't think the authors have updated the scripts yet. I'll just add try to push some updates.

xuanlinli17 avatar Jul 28 '24 16:07 xuanlinli17

I have made a run, here is the full result of vla:

I use my branch as codebase: https://github.com/hilookas/SimplerEnv

(note: the "real" result is set to 0)

(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=pick_coke_can
***Pick coke can results***
--------------------
horizontal sim variant avg success {'openvla': 0.711111111111111}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
vertical sim variant avg success {'openvla': 0.27111111111111114}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
standing sim variant avg success {'openvla': 0.6533333333333333}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
avg_orientation_sim_variant_results [0.5451851851851851]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_variant_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_variant_results, avg_orientation_real_results) 1
--------------------
Orientation horizontal, ckpt openvla all robot arm visual matching success: [0.28, 0.2, 0.32, 0.28]
Orientation vertical, ckpt openvla all robot arm visual matching success: [0.0, 0.08, 0.04, 0.0]
Orientation standing, ckpt openvla all robot arm visual matching success: [0.2, 0.08, 0.36, 0.12]
horizontal visual matching sim success {'openvla': 0.27}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
horizontal kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=7.976744186046492, pvalue=0.004738208143666835)
vertical visual matching sim success {'openvla': 0.03}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
vertical kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999995735, pvalue=0.3173105078630144)
standing visual matching sim success {'openvla': 0.19}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
standing kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.444444444444439, pvalue=0.019630657257290737)
avg_orientation_sim_visual_matching_results [0.16333333333333336]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=12.956521739130327, pvalue=0.0003188089440220541)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=move_near
***Move Near results***
--------------------
sim variant avg success {'openvla': 0.4770833333333333}
real success {'openvla': 0}
MMRV 0.0
pearson correlation 1
--------------------
Ckpt openvla all robot arm visual matching success: [0.5333333333333333, 0.43333333333333335, 0.4, 0.48333333333333334]
sim visual matching success {'openvla': 0.4625}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=36.21739130434784, pvalue=1.7648851059881935e-09)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=drawer
***Drawer results***
--------------------
open sim variant avg success {'openvla': 0.15873015873015872}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
close sim variant avg success {'openvla': 0.19576719576719576}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
avg_sim_variant_results [0.17724867724867724]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_variant_results, avg_real_results) 0.0
pearson_correlation(avg_sim_variant_results, avg_real_results) 1
--------------------
Drawer task open, ckpt openvla all robot arm visual matching success: [0.2222222222222222, 0.2222222222222222, 0.1388888888888889, 0.2222222222222222]
Drawer task close, ckpt openvla all robot arm visual matching success: [0.5555555555555556, 0.5, 0.5, 0.5185185185185186]
open visual matching sim success {'openvla': 0.19444444444444442}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
open kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=5.408163265306165, pvalue=0.02004279392017093)
close visual matching sim success {'openvla': 0.5185185185185185}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
close kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=18.549999999999994, pvalue=1.655052286653112e-05)
avg_sim_visual_matching_results [0.35648148148148145]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_visual_matching_results, avg_real_results) 0.0
pearson_correlation(avg_sim_visual_matching_results, avg_real_results) 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=22.842696629213542, pvalue=1.7581614124344239e-06)
********************



(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=bridge_put_on
***Bridge Put On Env results***
********** Results for put_spoon_on_tablecloth **********
sim visual matching partial success {'openvla': 0.041666666666666664}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_carrot_on_plate **********
sim visual matching partial success {'openvla': 0.3333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=9.399999999999977, pvalue=0.002169854427130147)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for stack_green_block_on_yellow_block **********
sim visual matching partial success {'openvla': 0.125}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=3.133333333333267, pvalue=0.07670675171090896)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             all same, 1.0
********************



********** Results for put_eggplant_in_basket **********
sim visual matching partial success {'openvla': 0.08333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success)  1
avg kruskal (partial success):
       each checkpoint kruskal:
             KruskalResult(statistic=2.0434782608695747, pvalue=0.15285977016579425)
sim visual matching success {'openvla': 0.041666666666666664}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
       each checkpoint kruskal:
             KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
********************



hilookas avatar Jul 29 '24 08:07 hilookas

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...

horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

xuanlinli17 avatar Jul 29 '24 17:07 xuanlinli17

@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?

Toradus avatar Aug 03 '24 12:08 Toradus

@hilookas Thank you for your great work! I want to try out the openVLA in the simulator as well. But I wonder why the performance in the sim is not as good as what is claimed by the paper in the real-world benchmark. It should not be due to the sim-to-real gap, right? Cause SIMPLER is designed to mitigate this gap.

QuanyiLi avatar Aug 03 '24 13:08 QuanyiLi

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc). image

xuanlinli17 avatar Aug 03 '24 15:08 xuanlinli17

@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc). image

Thanks. Look forward to the updated results!

QuanyiLi avatar Aug 03 '24 19:08 QuanyiLi

@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?

Sure! I have update the result. Please see log above.

My result is slightly different but not much from xuanlinli17's run in

For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...

horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]

hilookas avatar Aug 04 '24 08:08 hilookas

Here is updated table:

image

image

hilookas avatar Aug 04 '24 08:08 hilookas

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

yxchng avatar Aug 05 '24 14:08 yxchng

@yxchng It takes me 15G vram for 4090, following the official instructions to use bf16.

QuanyiLi avatar Aug 05 '24 14:08 QuanyiLi

How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.

1x3090 is enough. Just remember don't open env and inference process more than 1.

hilookas avatar Aug 07 '24 03:08 hilookas

Here is updated table:

image

image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

SiyuanHuang95 avatar Aug 08 '24 04:08 SiyuanHuang95

Here is updated table: image image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

hilookas avatar Aug 08 '24 07:08 hilookas

Here is updated table: image image

@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper

I made it :D

Based on my experiment above.

If you have another run result, please let me know!

I do not have enough GPU resources attached to a Screen the Sapien simulator requires. :-(, running OpenVLA locally is quite a burden for most consumer-level PC.

SiyuanHuang95 avatar Aug 09 '24 03:08 SiyuanHuang95

How should I run the OpenVLA program? I have updated the relevant files like files in scripts and code in main_inference.py, but I don't know exactly how to run them.Is there anything I should do in advance like downloading files in my main folder SimplerEnv-main? I tried the marks in the openvla_bridge.sh to run the main_inference.py,I wonder how would this generate because I have failed in the running because of the error: requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(20323803 bytes read, 6781837517 more expected)', IncompleteRead(20323803 bytes read, 6781837517 more expected)) Looking forward to anyone's answers..

taurusleo avatar Aug 10 '24 13:08 taurusleo

^ Seems that huggingface ckpt download was unexpectedly interrupted. You can clear the hf cache.

xuanlinli17 avatar Aug 11 '24 04:08 xuanlinli17