TORCH_USE_CUDA_DSA on policy act

Open Katehuuh opened this issue 8 months ago • 1 comments

System Info

Windows, CUDA 12.8, torch Nightly 2.8.0, using 1*So100 dataset include 3 ep/example of simple "pick and place".

Here for additional information of my full installation.

System Info: python lerobot/scripts/display_sys_info.py&python -m torch.utils.collect_env & python -c "import torch; print(torch.cuda.is_available())" && nvcc -V:

- 1*so100 fully assembled from wowrobo. 2*USB-C & power adapter for arms + 1*USB camera.
- `lerobot` version: 0.1.0
- Platform: Windows-10-10.0.22621-SP0
- Python version: 3.10.8
- Huggingface_hub version: 0.30.1
- Dataset version: 3.5.0
- Numpy version: 2.1.2
- PyTorch version (GPU?): 2.8.0.dev20250327+cu128 (True)
- Cuda version: 12080


PyTorch version: 2.8.0.dev20250327+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
OS: Microsoft Windows 11 Pro (10.0.22621 64-bit)
CMake version: version 4.0.0
Is CUDA available: True
Nvidia driver version: 572.83
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\cudnn_ops64_9.dll
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.1.2
[pip3] torch==2.8.0.dev20250327+cu128
[pip3] torchaudio==2.6.0.dev20250401+cu128
[pip3] torchcodec==0.0.0.dev0
[pip3] torchvision==0.22.0.dev20250401+cu128


True
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:42:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Set up Python environments:

git clone https://github.com/huggingface/lerobot.git && cd lerobot
python -m venv venv && venv\Scripts\activate
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
pip install av poetry-core # Fix Windows
pip install -e ".[feetech]"

Find motors bus port

(venv) C:\lerobot>python lerobot/scripts/find_motors_bus_port.py
Finding all available ports for the MotorsBus.
Ports before disconnecting: ['COM3', 'COM4']
Remove the USB cable from your MotorsBus and press Enter when done.

The port of this MotorsBus is 'COM4'
Reconnect the USB cable.

and find camera index python lerobot/common/robot_devices/cameras/opencv.py --images-dir outputs/images_from_opencv_cameras.
Use steps 3,4 to modify accordingly lerobot\common\robot_devices\robots\configs.py:

class So100RobotConfig(ManipulatorRobotConfig):
...
    leader_arms: dict[str, MotorsBusConfig] = field(
...
+                port="COM3",
...
    follower_arms: dict[str, MotorsBusConfig] = field(
...
+                port="COM4",
...
    cameras: dict[str, CameraConfig] = field(
        default_factory=lambda: {
            "laptop": OpenCVCameraConfig(
+                camera_index=2,
...
            )
        }
    )

Follow the steps from calibrate, Fix unexpected rotation by recalibrating one arm by adding --control.arms=[\"main_leader\"]:

python lerobot/scripts/control_robot.py --robot.type=so100 --control.type=calibrate

Verify calibration with teleoperate:

python lerobot/scripts/control_robot.py --robot.type=so100 --control.type=teleoperate

Creating a dataset with teleoperation; 7_get_started_with_real_robot.md Recommend at least 50 episodes, 10ep× 5 different starting location. (Stopped at 3):

python lerobot/scripts/control_robot.py --robot.type=so100 --control.type=record --control.fps=30 --control.single_task="Pick and place task" --control.repo_id=local/so100_dataset --control.root=C:/lerobot/training_data --control.push_to_hub=false --control.warmup_time_s=5 --control.episode_time_s=30 --control.reset_time_s=15 --control.num_episodes=50 --control.video=true

Training model with diffusion Policy, or pi0,tdmpc; but --policy.type=act does not work for me. 200k steps or until the loss started plateauing.

python lerobot/scripts/train.py --dataset.repo_id=local/so100_dataset --dataset.root=C:/lerobot/training_data --policy.type=diffusion --output_dir=C:/lerobot/model_diffusion --job_name=diffusion_training --policy.device=cuda --policy.use_amp=true --batch_size=8 --steps=200000 --num_workers=4 --wandb.enable=false --dataset.video_backend=pyav --seed=1234

Evaluation/Run model:

python lerobot/scripts/control_robot.py --robot.type=so100 --control.type=record --control.fps=30 --control.single_task="Diffusion policy evaluation" --control.repo_id=local/eval_diffusion --control.root=C:/lerobot/diffusion_eval --control.push_to_hub=false --control.warmup_time_s=5 --control.episode_time_s=30 --control.reset_time_s=15 --control.num_episodes=10 --control.video=true --control.policy.path=C:/lerobot/model_diffusion/checkpoints/last/pretrained_model --control.num_image_writer_processes=1

Information

[x] One of the scripts in the examples/ folder of LeRobot
[x] My own task or dataset (give details below)

Reproduction

(venv) C:\lerobot>python lerobot/scripts/train.py --dataset.repo_id=local/so100_quick --dataset.root=C:/lerobot/quick_demo --policy.type=act --output_dir=C:/lerobot/quick_model_act --job_name=quick_demo_act --policy.device=cuda --policy.use_amp=true --batch_size=1 --steps=5000 --num_workers=4 --wandb.enable=false --dataset.video_backend=pyav --policy.chunk_size=10 --policy.n_action_steps=10 --policy.n_heads=4 --policy.dim_feedforward=1024 --policy.n_encoder_layers=2
INFO ts\train.py:111 {'batch_size': 1,
 'dataset': {'episodes': None,
             'image_transforms': {'enable': False,
                                  'max_num_transforms': 3,
                                  'random_order': False,
                                  'tfs': {'brightness': {'kwargs': {'brightness': [0.8,
                                                                                   1.2]},
                                                         'type': 'ColorJitter',
                                                         'weight': 1.0},
                                          'contrast': {'kwargs': {'contrast': [0.8,
                                                                               1.2]},
                                                       'type': 'ColorJitter',
                                                       'weight': 1.0},
                                          'hue': {'kwargs': {'hue': [-0.05,
                                                                     0.05]},
                                                  'type': 'ColorJitter',
                                                  'weight': 1.0},
                                          'saturation': {'kwargs': {'saturation': [0.5,
                                                                                   1.5]},
                                                         'type': 'ColorJitter',
                                                         'weight': 1.0},
                                          'sharpness': {'kwargs': {'sharpness': [0.5,
                                                                                 1.5]},
                                                        'type': 'SharpnessJitter',
                                                        'weight': 1.0}}},
             'repo_id': 'local/so100_quick',
             'revision': None,
             'root': 'C:/lerobot/quick_demo',
             'use_imagenet_stats': True,
             'video_backend': 'pyav'},
 'env': None,
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': False},
 'eval_freq': 20000,
 'job_name': 'quick_demo_act',
 'log_freq': 200,
 'num_workers': 4,
 'optimizer': {'betas': [0.9, 0.999],
               'eps': 1e-08,
               'grad_clip_norm': 10.0,
               'lr': 1e-05,
               'type': 'adamw',
               'weight_decay': 0.0001},
 'output_dir': 'C:\\lerobot\\quick_model_act',
 'policy': {'chunk_size': 10,
            'device': 'cuda',
            'dim_feedforward': 1024,
            'dim_model': 512,
            'dropout': 0.1,
            'feedforward_activation': 'relu',
            'input_features': {},
            'kl_weight': 10.0,
            'latent_dim': 32,
            'n_action_steps': 10,
            'n_decoder_layers': 1,
            'n_encoder_layers': 2,
            'n_heads': 4,
            'n_obs_steps': 1,
            'n_vae_encoder_layers': 4,
            'normalization_mapping': {'ACTION': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
                                      'STATE': <NormalizationMode.MEAN_STD: 'MEAN_STD'>,
                                      'VISUAL': <NormalizationMode.MEAN_STD: 'MEAN_STD'>},
            'optimizer_lr': 1e-05,
            'optimizer_lr_backbone': 1e-05,
            'optimizer_weight_decay': 0.0001,
            'output_features': {},
            'pre_norm': False,
            'pretrained_backbone_weights': 'ResNet18_Weights.IMAGENET1K_V1',
            'replace_final_stride_with_dilation': False,
            'temporal_ensemble_coeff': None,
            'type': 'act',
            'use_amp': True,
            'use_vae': True,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'save_checkpoint': True,
 'save_freq': 20000,
 'scheduler': None,
 'seed': 1000,
 'steps': 5000,
 'use_policy_training_preset': True,
 'wandb': {'disable_artifact': False,
           'enable': False,
           'entity': None,
           'mode': None,
           'notes': None,
           'project': 'lerobot',
           'run_id': None}}
INFO ts\train.py:117 Logs will be saved locally.
INFO ts\train.py:127 Creating dataset
INFO ts\train.py:138 Creating policy
INFO ts\train.py:144 Creating optimizer and scheduler
INFO ts\train.py:156 Output dir: C:\lerobot\quick_model_act
INFO ts\train.py:159 cfg.steps=5000 (5K)
INFO ts\train.py:160 dataset.num_frames=2512 (3K)
INFO ts\train.py:161 dataset.num_episodes=3
INFO ts\train.py:162 num_learnable_params=27271942 (27M)
INFO ts\train.py:163 num_total_params=27271984 (27M)
INFO ts\train.py:202 Start offline training on a fixed dataset
C:\lerobot\venv\lib\site-packages\torch\autograd\graph.py:824: UserWarning: Ignoring invalid value for boolean flag CUDA_LAUNCH_BLOCKING: 1 valid values are 0 or 1. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\c10\util\env.cpp:91.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "C:\lerobot\lerobot\scripts\train.py", line 288, in <module>
    train()
  File "C:\lerobot\lerobot\configs\parser.py", line 227, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "C:\lerobot\lerobot\scripts\train.py", line 212, in train
    train_tracker, output_dict = update_policy(
  File "C:\lerobot\lerobot\scripts\train.py", line 73, in update_policy
    grad_scaler.scale(loss).backward()
  File "C:\lerobot\venv\lib\site-packages\torch\_tensor.py", line 648, in backward
    torch.autograd.backward(
  File "C:\lerobot\venv\lib\site-packages\torch\autograd\__init__.py", line 353, in backward
    _engine_run_backward(
  File "C:\lerobot\venv\lib\site-packages\torch\autograd\graph.py", line 824, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

However i do not think vram is issue since i can train --scheduler.type diffuser just fine.

Expected behavior

I have modified the act cmd to ensure low VRAM but without success.

Apr 02 '25 18:04 Katehuuh

I had the same issue with you. My gpu is rtx5080 and torch version is 2.8.0.dev20250331+cu128 But after upgrading torch version, it did work. pip install --upgrade --pre torch torchvision torchaudio torchcodec --index-url https://download.pytorch.org/whl/nightly/cu128 Current torch version is 2.8.0.dev20250407+cu128.

Apr 08 '25 01:04 eastflag