lerobot icon indicating copy to clipboard operation
lerobot copied to clipboard

Set "--policy.dtype = bfloat16" in the training script, get an error

Open RayTang88 opened this issue 3 months ago • 3 comments

System Info

- lerobot version: 0.3.4
- Platform: Linux-6.8.0-83-generic-x86_64-with-glibc2.31
- Python version: 3.10.12
- Huggingface Hub version: 0.35.3
- Datasets version: 4.0.0
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- Is PyTorch built with CUDA support?: True
- Cuda version: 12.6
- GPU model: NVIDIA GeForce RTX 4090
- Using GPU in script?: <fill in>

Information

  • [ ] One of the scripts in the examples/ folder of LeRobot
  • [x] My own task or dataset (give details below)

Reproduction

When I used the following command line to fine-tune pi05 with my own data

python src/lerobot/scripts/lerobot_train.py    --dataset.repo_id=./train_data/lerobot/task2_v30/lerobot     --policy.type=pi05     --output_dir=./outputs/pi05_training     --job_name=pi05_training     --policy.repo_id=pi05     --policy.pretrained_path=./model/vla/lerobot/pi05_base     --policy.compile_model=true     --policy.gradient_checkpointing=true     --wandb.enable=false     --policy.dtype=bfloat16     --steps=3000     --policy.device=cuda     --batch_size=8

error info

  File ".venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 448, in forward
    hidden_states = self.layer_norm1(hidden_states)
  File ".venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File ".venv/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 217, in forward
    return F.layer_norm(
  File ".venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2910, in layer_norm
    return torch.layer_norm(
RuntimeError: expected scalar type Float but found BFloat16

When I removed the "--policy.dtype=bfloat16" from the command line, this error did not occur.

Expected behavior

Would like to know how to train the model using BF16 data. Due to limited GPU memory, thanks very much!

RayTang88 avatar Oct 09 '25 05:10 RayTang88

could you try to load the pretrained model like this: --policy.pretrained_path=lerobot/pi05_base

pkooij avatar Oct 10 '25 08:10 pkooij

I also encountered this issue, this is the command line lerobot-train --dataset.repo_id=qingchu/lerobot_place_to_box --policy.type=pi05 --output_dir=/root/autodl-tmp/train/pi05 --job_name=pi05_training --policy.pretrained_path=lerobot/pi05_base --policy.compile_model=true --policy.gradient_checkpointing=true --wandb.enable=false --policy.dtype=bfloat16 --steps=20000 --policy.device=cuda --batch_size=8 --save_freq=2000 --policy.push_to_hub=false

qinglin-ai avatar Nov 20 '25 09:11 qinglin-ai

@qinglin-ai did you install pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git" ? Following https://huggingface.co/docs/lerobot/en/pi0#installation-requirements

pkooij avatar Nov 20 '25 09:11 pkooij