SimplerEnv
SimplerEnv copied to clipboard
openvla policy intergration
This pull request integrates openvla policy. The evaluation scripts remain consistent with the original repo under ./scripts/
Thanks for the contribution! Could you post the success rates for each task?
Could you post a source for the implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, sticky gripper match, and do they match the real eval?
Code comments:
- Remove " + OpenVLA policy" in readme
OpenVALInference->OpenVLAInferenceself.sticky_action_is_on is Falsein L147 ofopenvla_model.py: change to(not self.sticky_action_is_on)
Hi xuanlinli17, I have corrected all the typos. The implementation of horizon, pred_action_horizon, exec_horizon, action ensemble, etc., are migrating from octo. The evaluation results can be downloaded at
report_openvla.log
I see. Might want to get help from the official authors to validate / revise the implementation as the Bridge results are near zero for some reason, and pick coke can has large variance across different backgrounds. Additionally, it's possible that OpenVLA might not follow Octo implementation in real deployment.
Also you can modify https://github.com/simpler-env/SimplerEnv/blob/main/tools/calc_metrics_evaluation_videos.py to quickly summarize the results for OpenVLA (just put dummy numbers for the real numbers and don't push the script). You can ignore the nans.
I'm also working on implementing OpenVLA into SimplerEnv, and I had the same issue: OpenVLA fails drastically on Bridge. I wonder if that has anything to do with the controller mentioned in #11
Same here, Severe lack of performance of OpenVLA on WindowX robot.
I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:
from typing import Optional, Sequence
import os
import matplotlib.pyplot as plt
import numpy as np
from transforms3d.euler import euler2axangle
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
import cv2 as cv
class OpenVLAInference:
def __init__(
self,
saved_model_path: str = "openvla/openvla-7b",
unnorm_key: Optional[str] = None,
policy_setup: str = "widowx_bridge",
horizon: int = 1,
pred_action_horizon: int = 1,
exec_horizon: int = 1,
image_size: list[int] = [224, 224],
action_scale: float = 1.0,
) -> None:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
if policy_setup == "widowx_bridge":
unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key
self.sticky_gripper_num_repeat = 1
elif policy_setup == "google_robot":
unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key
self.sticky_gripper_num_repeat = 15
else:
raise NotImplementedError(
f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file."
)
self.policy_setup = policy_setup
self.unnorm_key = unnorm_key
print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***")
self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True)
self.vla = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn`
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
).cuda()
self.image_size = image_size
self.action_scale = action_scale
self.horizon = horizon
self.pred_action_horizon = pred_action_horizon
self.exec_horizon = exec_horizon
self.sticky_action_is_on = False
self.gripper_action_repeat = 0
self.sticky_gripper_action = 0.0
self.previous_gripper_action = None
self.task = None
self.task_description = None
self.num_image_history = 0
def reset(self, task_description: str) -> None:
self.task_description = task_description
self.num_image_history = 0
self.sticky_action_is_on = False
self.gripper_action_repeat = 0
self.sticky_gripper_action = 0.0
self.previous_gripper_action = None
def step(
self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs
) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]:
"""
Input:
image: np.ndarray of shape (H, W, 3), uint8
task_description: Optional[str], task description; if different from previous task description, policy state is reset
Output:
raw_action: dict; raw policy action output
action: dict; processed action to be sent to the maniskill2 environment, with the following keys:
- 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector
- 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation
- 'gripper': np.ndarray of shape (1,), gripper action
- 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise
"""
if task_description is not None:
if task_description != self.task_description:
self.reset(task_description)
assert image.dtype == np.uint8
image = self._resize_image(image)
image: Image.Image = Image.fromarray(image)
prompt = task_description
# predict action (7-dof; un-normalize for bridgev2)
inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None]
# print(f"*** raw actions {raw_actions} ***")
raw_action = {
"world_vector": np.array(raw_actions[0, :3]),
"rotation_delta": np.array(raw_actions[0, 3:6]),
"open_gripper": np.array(raw_actions[0, 6:7]), # range [0, 1]; 1 = open; 0 = close
}
# process raw_action to obtain the action to be sent to the maniskill2 environment
action = {}
action["world_vector"] = raw_action["world_vector"] * self.action_scale
action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64)
roll, pitch, yaw = action_rotation_delta
action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw)
action_rotation_axangle = action_rotation_ax * action_rotation_angle
action["rot_axangle"] = action_rotation_axangle * self.action_scale
if self.policy_setup == "google_robot":
current_gripper_action = raw_action["open_gripper"]
if self.previous_gripper_action is None:
relative_gripper_action = np.array([0])
else:
relative_gripper_action = self.previous_gripper_action - current_gripper_action
self.previous_gripper_action = current_gripper_action
if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on):
self.sticky_action_is_on = True
self.sticky_gripper_action = relative_gripper_action
if self.sticky_action_is_on:
self.gripper_action_repeat += 1
relative_gripper_action = self.sticky_gripper_action
if self.gripper_action_repeat == self.sticky_gripper_num_repeat:
self.sticky_action_is_on = False
self.gripper_action_repeat = 0
self.sticky_gripper_action = 0.0
action["gripper"] = relative_gripper_action
elif self.policy_setup == "widowx_bridge":
action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0
action["terminate_episode"] = np.array([0.0])
return raw_action, action
def _resize_image(self, image: np.ndarray) -> np.ndarray:
image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA)
return image
def visualize_epoch(
self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str
) -> None:
images = [self._resize_image(image) for image in images]
ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"]
img_strip = np.concatenate(np.array(images[::3]), axis=1)
# set up plt figure
figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS]
plt.rcParams.update({"font.size": 12})
fig, axs = plt.subplot_mosaic(figure_layout)
fig.set_size_inches([45, 10])
# plot actions
pred_actions = np.array(
[
np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1)
for a in predicted_raw_actions
]
)
for action_dim, action_label in enumerate(ACTION_DIM_LABELS):
# actions have batch, horizon, dim, in this example we just take the first action for simplicity
axs[action_label].plot(pred_actions[:, action_dim], label="predicted action")
axs[action_label].set_title(action_label)
axs[action_label].set_xlabel("Time in one episode")
axs["image"].imshow(img_strip)
axs["image"].set_xlabel("Time in one episode (subsampled)")
plt.legend()
plt.savefig(save_path)
OpenVLA setup requirements:
pip install torch==2.3.1 torchvision==0.18.1 timm==0.9.10 tokenizers==0.15.2 accelerate==0.32.1
pip install flash-attn==2.6.1 --no-build-isolation
Please add these instructions to Readme and add an "OpenVLA Inference Setup" section
Typos in scripts/openvla_put_in_drawer_variant_agg.sh and scripts/openvla_put_in_drawer_visual_matching.sh: should be "declare -a ckpt_paths=("openvla/openvla-7b")"
I checked with the authors and I don't think there is action ensembling or action history. Here is the updated code, which you can try:
from typing import Optional, Sequence import os import matplotlib.pyplot as plt import numpy as np from transforms3d.euler import euler2axangle from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image import torch import cv2 as cv class OpenVLAInference: def __init__( self, saved_model_path: str = "openvla/openvla-7b", unnorm_key: Optional[str] = None, policy_setup: str = "widowx_bridge", horizon: int = 1, pred_action_horizon: int = 1, exec_horizon: int = 1, image_size: list[int] = [224, 224], action_scale: float = 1.0, ) -> None: os.environ["TOKENIZERS_PARALLELISM"] = "false" if policy_setup == "widowx_bridge": unnorm_key = "bridge_orig" if unnorm_key is None else unnorm_key self.sticky_gripper_num_repeat = 1 elif policy_setup == "google_robot": unnorm_key = "fractal20220817_data" if unnorm_key is None else unnorm_key self.sticky_gripper_num_repeat = 15 else: raise NotImplementedError( f"Policy setup {policy_setup} not supported for octo models. The other datasets can be found in the huggingface config.json file." ) self.policy_setup = policy_setup self.unnorm_key = unnorm_key print(f"*** policy_setup: {policy_setup}, unnorm_key: {unnorm_key} ***") self.processor = AutoProcessor.from_pretrained(saved_model_path, trust_remote_code=True) self.vla = AutoModelForVision2Seq.from_pretrained( "openvla/openvla-7b", attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn` torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True, ).cuda() self.image_size = image_size self.action_scale = action_scale self.horizon = horizon self.pred_action_horizon = pred_action_horizon self.exec_horizon = exec_horizon self.sticky_action_is_on = False self.gripper_action_repeat = 0 self.sticky_gripper_action = 0.0 self.previous_gripper_action = None self.task = None self.task_description = None self.num_image_history = 0 def reset(self, task_description: str) -> None: self.task_description = task_description self.num_image_history = 0 self.sticky_action_is_on = False self.gripper_action_repeat = 0 self.sticky_gripper_action = 0.0 self.previous_gripper_action = None def step( self, image: np.ndarray, task_description: Optional[str] = None, *args, **kwargs ) -> tuple[dict[str, np.ndarray], dict[str, np.ndarray]]: """ Input: image: np.ndarray of shape (H, W, 3), uint8 task_description: Optional[str], task description; if different from previous task description, policy state is reset Output: raw_action: dict; raw policy action output action: dict; processed action to be sent to the maniskill2 environment, with the following keys: - 'world_vector': np.ndarray of shape (3,), xyz translation of robot end-effector - 'rot_axangle': np.ndarray of shape (3,), axis-angle representation of end-effector rotation - 'gripper': np.ndarray of shape (1,), gripper action - 'terminate_episode': np.ndarray of shape (1,), 1 if episode should be terminated, 0 otherwise """ if task_description is not None: if task_description != self.task_description: self.reset(task_description) assert image.dtype == np.uint8 image = self._resize_image(image) image: Image.Image = Image.fromarray(image) prompt = task_description # predict action (7-dof; un-normalize for bridgev2) inputs = self.processor(prompt, image).to("cuda:0", dtype=torch.bfloat16) raw_actions = self.vla.predict_action(**inputs, unnorm_key=self.unnorm_key, do_sample=False)[None] # print(f"*** raw actions {raw_actions} ***") raw_action = { "world_vector": np.array(raw_actions[0, :3]), "rotation_delta": np.array(raw_actions[0, 3:6]), "open_gripper": np.array(raw_actions[0, 6:7]), # range [0, 1]; 1 = open; 0 = close } # process raw_action to obtain the action to be sent to the maniskill2 environment action = {} action["world_vector"] = raw_action["world_vector"] * self.action_scale action_rotation_delta = np.asarray(raw_action["rotation_delta"], dtype=np.float64) roll, pitch, yaw = action_rotation_delta action_rotation_ax, action_rotation_angle = euler2axangle(roll, pitch, yaw) action_rotation_axangle = action_rotation_ax * action_rotation_angle action["rot_axangle"] = action_rotation_axangle * self.action_scale if self.policy_setup == "google_robot": current_gripper_action = raw_action["open_gripper"] if self.previous_gripper_action is None: relative_gripper_action = np.array([0]) else: relative_gripper_action = self.previous_gripper_action - current_gripper_action self.previous_gripper_action = current_gripper_action if np.abs(relative_gripper_action) > 0.5 and (not self.sticky_action_is_on): self.sticky_action_is_on = True self.sticky_gripper_action = relative_gripper_action if self.sticky_action_is_on: self.gripper_action_repeat += 1 relative_gripper_action = self.sticky_gripper_action if self.gripper_action_repeat == self.sticky_gripper_num_repeat: self.sticky_action_is_on = False self.gripper_action_repeat = 0 self.sticky_gripper_action = 0.0 action["gripper"] = relative_gripper_action elif self.policy_setup == "widowx_bridge": action["gripper"] = 2.0 * (raw_action["open_gripper"] > 0.5) - 1.0 action["terminate_episode"] = np.array([0.0]) return raw_action, action def _resize_image(self, image: np.ndarray) -> np.ndarray: image = cv.resize(image, tuple(self.image_size), interpolation=cv.INTER_AREA) return image def visualize_epoch( self, predicted_raw_actions: Sequence[np.ndarray], images: Sequence[np.ndarray], save_path: str ) -> None: images = [self._resize_image(image) for image in images] ACTION_DIM_LABELS = ["x", "y", "z", "roll", "pitch", "yaw", "grasp"] img_strip = np.concatenate(np.array(images[::3]), axis=1) # set up plt figure figure_layout = [["image"] * len(ACTION_DIM_LABELS), ACTION_DIM_LABELS] plt.rcParams.update({"font.size": 12}) fig, axs = plt.subplot_mosaic(figure_layout) fig.set_size_inches([45, 10]) # plot actions pred_actions = np.array( [ np.concatenate([a["world_vector"], a["rotation_delta"], a["open_gripper"]], axis=-1) for a in predicted_raw_actions ] ) for action_dim, action_label in enumerate(ACTION_DIM_LABELS): # actions have batch, horizon, dim, in this example we just take the first action for simplicity axs[action_label].plot(pred_actions[:, action_dim], label="predicted action") axs[action_label].set_title(action_label) axs[action_label].set_xlabel("Time in one episode") axs["image"].imshow(img_strip) axs["image"].set_xlabel("Time in one episode (subsampled)") plt.legend() plt.savefig(save_path)
Hello @xuanlinli17, I tried the code above but openvla still fails on widowx tasks. Is it possibly an implementation problem? Should I set any params for widowx?
Yeah that's my finding too, but I don't think the authors did some special treatments to evaluate OpenVLA on Bridge.
There might be some coordinate transforms on Bridge which is different.
@xuanlinli17 Thank you! I will continue to look into this. If I find any additional information or solutions, I'll make sure to share it with you.
When I run the scripts bash scripts/openvla_bridge.sh in ./SimplerEnv-OpenVLA, an error appears:
from simpler_env.policies.openvla.openvla_model import OpenVLAInference
ModuleNotFoundError: No module named 'simpler_env.policies.openvla'
How to fix it?
^ Run pip install -e . in the SimplerEnv-OpenVLA repo. Also I don't think the authors have updated the scripts yet. I'll just add try to push some updates.
I have made a run, here is the full result of vla:
I use my branch as codebase: https://github.com/hilookas/SimplerEnv
(note: the "real" result is set to 0)
(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=pick_coke_can
***Pick coke can results***
--------------------
horizontal sim variant avg success {'openvla': 0.711111111111111}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
vertical sim variant avg success {'openvla': 0.27111111111111114}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
standing sim variant avg success {'openvla': 0.6533333333333333}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
avg_orientation_sim_variant_results [0.5451851851851851]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_variant_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_variant_results, avg_orientation_real_results) 1
--------------------
Orientation horizontal, ckpt openvla all robot arm visual matching success: [0.28, 0.2, 0.32, 0.28]
Orientation vertical, ckpt openvla all robot arm visual matching success: [0.0, 0.08, 0.04, 0.0]
Orientation standing, ckpt openvla all robot arm visual matching success: [0.2, 0.08, 0.36, 0.12]
horizontal visual matching sim success {'openvla': 0.27}
horizontal real success {'openvla': 0}
horizontal MMRV 0.0
horizontal pearson correlation 1
horizontal kruskal:
each checkpoint kruskal:
KruskalResult(statistic=7.976744186046492, pvalue=0.004738208143666835)
vertical visual matching sim success {'openvla': 0.03}
vertical real success {'openvla': 0}
vertical MMRV 0.0
vertical pearson correlation 1
vertical kruskal:
each checkpoint kruskal:
KruskalResult(statistic=0.9999999999995735, pvalue=0.3173105078630144)
standing visual matching sim success {'openvla': 0.19}
standing real success {'openvla': 0}
standing MMRV 0.0
standing pearson correlation 1
standing kruskal:
each checkpoint kruskal:
KruskalResult(statistic=5.444444444444439, pvalue=0.019630657257290737)
avg_orientation_sim_visual_matching_results [0.16333333333333336]
avg_orientation_real_results [0.0]
mean_maximum_rank_violation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 0.0
pearson_correlation(avg_orientation_sim_visual_matching_results, avg_orientation_real_results) 1
avg kruskal:
each checkpoint kruskal:
KruskalResult(statistic=12.956521739130327, pvalue=0.0003188089440220541)
********************
(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=move_near
***Move Near results***
--------------------
sim variant avg success {'openvla': 0.4770833333333333}
real success {'openvla': 0}
MMRV 0.0
pearson correlation 1
--------------------
Ckpt openvla all robot arm visual matching success: [0.5333333333333333, 0.43333333333333335, 0.4, 0.48333333333333334]
sim visual matching success {'openvla': 0.4625}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
each checkpoint kruskal:
KruskalResult(statistic=36.21739130434784, pvalue=1.7648851059881935e-09)
********************
(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=drawer
***Drawer results***
--------------------
open sim variant avg success {'openvla': 0.15873015873015872}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
close sim variant avg success {'openvla': 0.19576719576719576}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
avg_sim_variant_results [0.17724867724867724]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_variant_results, avg_real_results) 0.0
pearson_correlation(avg_sim_variant_results, avg_real_results) 1
--------------------
Drawer task open, ckpt openvla all robot arm visual matching success: [0.2222222222222222, 0.2222222222222222, 0.1388888888888889, 0.2222222222222222]
Drawer task close, ckpt openvla all robot arm visual matching success: [0.5555555555555556, 0.5, 0.5, 0.5185185185185186]
open visual matching sim success {'openvla': 0.19444444444444442}
open real success {'openvla': 0}
open MMRV 0.0
open pearson correlation 1
open kruskal:
each checkpoint kruskal:
KruskalResult(statistic=5.408163265306165, pvalue=0.02004279392017093)
close visual matching sim success {'openvla': 0.5185185185185185}
close real success {'openvla': 0}
close MMRV 0.0
close pearson correlation 1
close kruskal:
each checkpoint kruskal:
KruskalResult(statistic=18.549999999999994, pvalue=1.655052286653112e-05)
avg_sim_visual_matching_results [0.35648148148148145]
avg_real_results [0.0]
mean_maximum_rank_violation(avg_sim_visual_matching_results, avg_real_results) 0.0
pearson_correlation(avg_sim_visual_matching_results, avg_real_results) 1
avg kruskal:
each checkpoint kruskal:
KruskalResult(statistic=22.842696629213542, pvalue=1.7581614124344239e-06)
********************
(openvla) ubuntu@soda:~/SimplerEnv$ python tools/calc_metrics_evaluation_videos.py --task=bridge_put_on
***Bridge Put On Env results***
********** Results for put_spoon_on_tablecloth **********
sim visual matching partial success {'openvla': 0.041666666666666664}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success) 1
avg kruskal (partial success):
each checkpoint kruskal:
KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
each checkpoint kruskal:
all same, 1.0
********************
********** Results for put_carrot_on_plate **********
sim visual matching partial success {'openvla': 0.3333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success) 1
avg kruskal (partial success):
each checkpoint kruskal:
KruskalResult(statistic=9.399999999999977, pvalue=0.002169854427130147)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
each checkpoint kruskal:
all same, 1.0
********************
********** Results for stack_green_block_on_yellow_block **********
sim visual matching partial success {'openvla': 0.125}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success) 1
avg kruskal (partial success):
each checkpoint kruskal:
KruskalResult(statistic=3.133333333333267, pvalue=0.07670675171090896)
sim visual matching success {'openvla': 0.0}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
each checkpoint kruskal:
all same, 1.0
********************
********** Results for put_eggplant_in_basket **********
sim visual matching partial success {'openvla': 0.08333333333333333}
real partial success {'openvla': 0}
visual matching MMRV (partial success) 0.0
visual matching pearson correlation (partial success) 1
avg kruskal (partial success):
each checkpoint kruskal:
KruskalResult(statistic=2.0434782608695747, pvalue=0.15285977016579425)
sim visual matching success {'openvla': 0.041666666666666664}
real success {'openvla': 0}
visual matching MMRV 0.0
visual matching pearson correlation 1
avg kruskal:
each checkpoint kruskal:
KruskalResult(statistic=0.9999999999997733, pvalue=0.3173105078629661)
********************
For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...
horizontal sim variant avg success {'openvla-7b': 0.6444444444444444}
vertical sim variant avg success {'openvla-7b': 0.2177777777777778}
standing sim variant avg success {'openvla-7b': 0.7288888888888888}
avg_orientation_sim_variant_results [ 0.5303703703703704]
@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?
@hilookas Thank you for your great work! I want to try out the openVLA in the simulator as well. But I wonder why the performance in the sim is not as good as what is claimed by the paper in the real-world benchmark. It should not be due to the sim-to-real gap, right? Cause SIMPLER is designed to mitigate this gap.
@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc).
@QuanyiLi OpenVLA did 5 trials in real for each task (and there's no grid-based evals with >= 50 trials per task for Google Robot like in Simpler). Task settings like the cabinets and backgrounds used in the real world can also be different. We are requesting paired sim-real evaluation from Google following Simpler's protocol (and the same backgrounds, cabinets, etc).
Thanks. Look forward to the updated results!
@hilookas Thanks for providing the results! It seems like the open_drawer_variant_agg version is missing, as "horizontal sim variant avg success {'openvla': nan}". Is it possible for you to also do that evaluation and edit your results list?
Sure! I have update the result. Please see log above.
My result is slightly different but not much from xuanlinli17's run in
For Google Robot pick coke can, looks like the variant aggregation eval of OpenVLA is a lot better than visual matching, which is interesting...
horizontal sim variant avg success {'openvla-7b': 0.6444444444444444} vertical sim variant avg success {'openvla-7b': 0.2177777777777778} standing sim variant avg success {'openvla-7b': 0.7288888888888888} avg_orientation_sim_variant_results [ 0.5303703703703704]
Here is updated table:
How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.
@yxchng It takes me 15G vram for 4090, following the official instructions to use bf16.
How much memory is needed to run OpenVLA? I tried 3090 and 40GB A100 but both go out of memory.
1x3090 is enough. Just remember don't open env and inference process more than 1.
Here is updated table:
@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper
Here is updated table:
![]()
@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper
I made it :D
Based on my experiment above.
If you have another run result, please let me know!
Here is updated table:
![]()
@hilookas Would like to know where the tables come from? I did not see them in the OpenVLA paper
I made it :D
Based on my experiment above.
If you have another run result, please let me know!
I do not have enough GPU resources attached to a Screen the Sapien simulator requires. :-(, running OpenVLA locally is quite a burden for most consumer-level PC.
How should I run the OpenVLA program? I have updated the relevant files like files in scripts and code in main_inference.py, but I don't know exactly how to run them.Is there anything I should do in advance like downloading files in my main folder SimplerEnv-main? I tried the marks in the openvla_bridge.sh to run the main_inference.py,I wonder how would this generate because I have failed in the running because of the error: requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(20323803 bytes read, 6781837517 more expected)', IncompleteRead(20323803 bytes read, 6781837517 more expected)) Looking forward to anyone's answers..
^ Seems that huggingface ckpt download was unexpectedly interrupted. You can clear the hf cache.


