lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

[pi0] confusion about the state embedding dimension in `embed_suffix`

Open IrvingF7 opened this issue 10 months ago • 3 comments

System Info

- `lerobot` version: 0.1.0
- Platform: Linux-5.14.0-284.86.1.el9_2.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Dataset version: 3.2.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Cuda version: 12040
- Using GPU in script?: Yes

Information

  • [x] One of the scripts in the examples/ folder of LeRobot
  • [ ] My own task or dataset (give details below)

Reproduction

In the model definition of modeling_pi0.py, line 567, we see that

# Embed state
state_emb = self.state_proj(state)
state_emb = state_emb.to(dtype=torch.bfloat16)
embs.append(state_emb[:, None, :])
bsize = state_emb.shape[0]
dtype = state_emb.dtype
device = state_emb.device

We see that the state embedding dimension is bumped up at the 1st dimension.

The problem is, models like pi0 usually use datasets that have n_obs_steps., which is the default of LeRobot's own datasets as well. For example, if I use the pusht dataset as specified in this LeRobot example script, we see that the dimension of the dataset looks something like this

image shape torch.Size([64, 2, 3, 96, 96])
state shape torch.Size([64, 2, 2])
action shape torch.Size([64, 16, 2])

The first 2 in the dimensions of image and state come from the fact that the dataset gives you two frames of the past in one batch. The 16 in action comes from the fact that diffusion policy has an action horizon of 16 frames in the future.

Now, if we train on dataset like this or any similar dataset, it would have a dimension mismatch in embed_suffix because it would bump the state_embedding and give you something like

RuntimeError: Tensors must have same number of dimensions: got 4 and 3

For pi0 it's more or less okay, because the default n_obs_steps is usually 1, so you can squeeze out the 1st dimension of state, but this current way doesn't seem very expendable in the future, and also not consistent with LeRobot's usual dataset format.

Expected behavior

I would like to hear some reasoning behind the design choice like this so I can know if I am misunderstanding something.

Thank you very much in advance!

IrvingF7 avatar Feb 19 '25 03:02 IrvingF7

Same problem here, do you have any insight already driven from debugging?

LumenYoung avatar Mar 24 '25 14:03 LumenYoung

Update on this problem. I think this is mainly because of the dataset creation process that addes an additional dimension to the state. I can find that during the add_frame() function for dataset creation, it has created an additional list around the observation state and afterwards stack it so that the observation state would just gain an additional skeleton dimension. Becomes [B, 1, state_dim].

Also in the pi0, it assumes that the input state should be [B, state_dim]. It additionally create this skeleton dimension again, leading to RuntimeError: Tensors must have same number of dimensions: got 4 and 3.

LumenYoung avatar Mar 24 '25 15:03 LumenYoung

数据格式造成的,pi0没有准确读取state,所以默认初始化。可能你数据集里是states,但是pi0读取的是state

caijianwei1996 avatar Apr 08 '25 09:04 caijianwei1996

This issue has been automatically marked as stale because it has not had recent activity (6 months). It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 06 '25 02:10 github-actions[bot]

This issue was closed because it has been stalled for 14 days with no activity. Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.

github-actions[bot] avatar Oct 20 '25 02:10 github-actions[bot]