lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

smolvla training bugs on 3090

Open Loki-Lu opened this issue 6 months ago • 4 comments

System Info

GPU:3090

Information

  • [x] One of the scripts in the examples/ folder of LeRobot
  • [ ] My own task or dataset (give details below)

Reproduction

python lerobot/scripts/train.py
--policy.path=lerobot/smolvla_base
--dataset.repo_id=lerobot/svla_so100_stacking
--batch_size=8
--steps=200000

Expected behavior

When I train Smolvla on my 3090, it will stuck like this:

INFO 2025-06-09 11:25:47 ts/train.py:117 Logs will be saved locally.
INFO 2025-06-09 11:25:47 ts/train.py:127 Creating dataset
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 555274.29it/s]
INFO 2025-06-09 11:25:49 ts/train.py:138 Creating policy
Loading  HuggingFaceTB/SmolVLM2-500M-Video-Instruct weights ...
INFO 2025-06-09 11:25:53 modeling.py:991 We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Reducing the number of VLM layers to 16 ...

As you can see, it will be trained for several steps and then it failed. There is no error in cmd line. Image I am confused for this failure. Does someone solve this problem?

Loki-Lu avatar Jun 09 '25 04:06 Loki-Lu

Hey @Loki-Lu 👋 Personally never seen this behavior myself 🤔 Perhaps you could try to run the following dummy command to see if you are indeed able to successfully run training once? If that succeeds, then the error might be due to something with your run and we could troubleshoot from there

python lerobot/scripts/train.py
--policy.path=lerobot/smolvla_base
--dataset.repo_id=lerobot/svla_so100_stacking
--batch_size=8
--steps=10

fracapuano avatar Jun 11 '25 14:06 fracapuano

Hey @Loki-Lu 👋 Personally never seen this behavior myself 🤔 Perhaps you could try to run the following dummy command to see if you are indeed able to successfully run training once? If that succeeds, then the error might be due to something with your run and we could troubleshoot from there

python lerobot/scripts/train.py
--policy.path=lerobot/smolvla_base
--dataset.repo_id=lerobot/svla_so100_stacking
--batch_size=8
--steps=10

Hi, fracapuano. Thank you for your reply! It works if we decreased the steps and I am using lerobot/svla_so100_stacking. However, It failed without any bugs or errors when I use my own 800 groups dataset. I guess it is my cuda's problem or datasets' problem. I also test this in 4090 and I have the same problem. I am not sure what exactly problem it is. BTW, these 800 datasets work on ACT, DP, pi0,pi0-fast. So, that makes me confused.

Loki-Lu avatar Jun 13 '25 05:06 Loki-Lu

Hey @Loki-Lu 👋 Mind sharing some links so that I can have a look? E.g., can you share:

  1. Link to the datasets on the hub
  2. Perhaps open the wanbd run so that I can take a look there

Thank you!

fracapuano avatar Jun 13 '25 07:06 fracapuano

Hey @Loki-Lu 👋 Mind sharing some links so that I can have a look? E.g., can you share:

  1. Link to the datasets on the hub
  2. Perhaps open the wanbd run so that I can take a look there

Thank you!

Hi, @fracapuano . Thank you for your reply.

  1. Loki0929/so100_lan. This is the link for my dataset.
  2. My Wandb link

Loki-Lu avatar Jun 16 '25 07:06 Loki-Lu