[pi0] confusion about the state embedding dimension in `embed_suffix`
System Info
- `lerobot` version: 0.1.0
- Platform: Linux-5.14.0-284.86.1.el9_2.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.28.1
- Dataset version: 3.2.0
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Cuda version: 12040
- Using GPU in script?: Yes
Information
- [x] One of the scripts in the examples/ folder of LeRobot
- [ ] My own task or dataset (give details below)
Reproduction
In the model definition of modeling_pi0.py, line 567, we see that
# Embed state
state_emb = self.state_proj(state)
state_emb = state_emb.to(dtype=torch.bfloat16)
embs.append(state_emb[:, None, :])
bsize = state_emb.shape[0]
dtype = state_emb.dtype
device = state_emb.device
We see that the state embedding dimension is bumped up at the 1st dimension.
The problem is, models like pi0 usually use datasets that have n_obs_steps., which is the default of LeRobot's own datasets as well. For example, if I use the pusht dataset as specified in this LeRobot example script, we see that the dimension of the dataset looks something like this
image shape torch.Size([64, 2, 3, 96, 96])
state shape torch.Size([64, 2, 2])
action shape torch.Size([64, 16, 2])
The first 2 in the dimensions of image and state come from the fact that the dataset gives you two frames of the past in one batch. The 16 in action comes from the fact that diffusion policy has an action horizon of 16 frames in the future.
Now, if we train on dataset like this or any similar dataset, it would have a dimension mismatch in embed_suffix because it would bump the state_embedding and give you something like
RuntimeError: Tensors must have same number of dimensions: got 4 and 3
For pi0 it's more or less okay, because the default n_obs_steps is usually 1, so you can squeeze out the 1st dimension of state, but this current way doesn't seem very expendable in the future, and also not consistent with LeRobot's usual dataset format.
Expected behavior
I would like to hear some reasoning behind the design choice like this so I can know if I am misunderstanding something.
Thank you very much in advance!
Same problem here, do you have any insight already driven from debugging?
Update on this problem. I think this is mainly because of the dataset creation process that addes an additional dimension to the state. I can find that during the add_frame() function for dataset creation, it has created an additional list around the observation state and afterwards stack it so that the observation state would just gain an additional skeleton dimension. Becomes [B, 1, state_dim].
Also in the pi0, it assumes that the input state should be [B, state_dim]. It additionally create this skeleton dimension again, leading to RuntimeError: Tensors must have same number of dimensions: got 4 and 3.
数据格式造成的,pi0没有准确读取state,所以默认初始化。可能你数据集里是states,但是pi0读取的是state
This issue has been automatically marked as stale because it has not had recent activity (6 months). It will be closed if no further activity occurs. Thank you for your contributions.
This issue was closed because it has been stalled for 14 days with no activity. Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.