sd-scripts ⛔[SD3 branch] Huber+SNR RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Pytorch 2.7.0, CUDA 12.8

What: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when using Huber + SNR

Where: train_util.py, line 6025

Fix:

From alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu()) To alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.to(noise_scheduler.alphas_cumprod.device))

May 29 '25 01:05 deGENERATIVE-SQUAD

It seems to work without fix in my environment. Could you please share the command line arguments for training?

Jun 02 '25 12:06 kohya-ss

@kohya-ss

Traceback (most recent call last):
  File "K:\sd-scripts\sd-scripts-sd3\sdxl_train_network.py", line 229, in <module>
    trainer.train(args)
  File "K:\sd-scripts\sd-scripts-sd3\train_network.py", line 1403, in train
    loss = self.process_batch(
           ^^^^^^^^^^^^^^^^^^^
  File "K:\sd-scripts\sd-scripts-sd3\train_network.py", line 463, in process_batch
    huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "K:\sd-scripts\sd-scripts-sd3\library\train_util.py", line 6025, in get_huber_threshold_if_needed
    alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
steps:   0%|                                                                                  | 0/1000 [00:19<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 50, in main
    args.func(args)
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1198, in launch_command
    simple_launcher(args)
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 785, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)`


`accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
	--pretrained_model_name_or_path="K:/model.safetensors" ^
	--train_data_dir="data" ^
	--output_dir="output_dir" ^
	--output_name="000" ^
	--network_args "algo=boft" "rescaled=True" "constraint=1" "conv_dim=32" "conv_alpha=1" "preset=full" "train_norm=False" ^
	--resolution="864" ^
	--save_model_as="safetensors" ^
	--network_module="lycoris.kohya" ^
	--shuffle_caption ^
	--max_train_epochs=100 ^
	--save_every_n_epochs=1 ^
	--save_state_on_train_end ^
	--save_precision=bf16 ^
	--network_dim=32 ^
	--network_alpha=1 ^
	--train_batch_size=2 ^
	--gradient_accumulation_steps=2 ^
	--max_data_loader_n_workers=1 ^
	--enable_bucket ^
	--bucket_reso_steps=64 ^
	--min_bucket_reso=768 ^
	--max_bucket_reso=1280 ^
	--mixed_precision="bf16" ^
	--caption_extension=".txt" ^
	--gradient_checkpointing ^
	--optimizer_type="prodigyplus.prodigy_plus_schedulefree.ProdigyPlusScheduleFree" ^
	--optimizer_args "d0=1e-5" "d_limiter=False" "use_stableadamw=False" "weight_decay_by_lr=False" ^
	--learning_rate=1 ^
	--network_train_unet_only ^
	--loss_type="huber" ^
	--huber_schedule="snr" ^
	--huber_c=0.1 ^
	--xformers ^
	--vae="K:/sdxl_vae_fix.safetensors" ^
	--full_bf16 ^
	--mem_eff_attn ^
	--seed=1 ^
	--logging_dir="logs" ^
	--log_with="tensorboard" ^
	--persistent_data_loader_workers ^`

Jun 02 '25 21:06 deGENERATIVE-SQUAD

Thank you for shariing the command. I don't have LyCORIS installed so I can't test that. The command gives the error: TypeError: ProdigyPlusScheduleFree.__init__() got an unexpected keyword argument 'd_limiter'.

I think prodigy_plus_schedulefree or LyCORIS might be the reason. Could you try removing them to see what is causing the error?

Jun 03 '25 14:06 kohya-ss

@kohya-ss

This definitely doesn’t depend on the optimizer being used - the same issue occurs with Adafactor or any other optimizer. d_limiter is not the root of the problem, it is a Prodigy Plus argument (see: https://github.com/LoganBooker/prodigy-plus-schedule-free/blob/main/prodigyplus/prodigy_plus_schedulefree.py), and it works fine.

I haven’t tested without LyCORIS yet, since I’m using non-standard algorithms and haven’t used the standard LoRA module - but I can try it later.

Jun 03 '25 16:06 deGENERATIVE-SQUAD

Yeah I have had this issue myself as well but haven't identified why it happens sometimes. Probably need to identify why the alpha_cumprod is on non-CPU device and then this would work as expected.

Jun 05 '25 20:06 rockerBOO

Pytorch 2.7.0, CUDA 12.8

What: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when using Huber + SNR

Where: train_util.py, line 6025

Fix:

From alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu()) To alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.to(noise_scheduler.alphas_cumprod.device))

Same problem and fix worked on my Linux box as well. I'm running standard LoRA training so I don't think LyCORIS is involved.

Torch 2.7 CU 12.6, ProdigyPlusScheduleFree, loss_type: Huber, Huber schedule: SNR, SDPA attention. Triton uninstalled entirely.

Jun 06 '25 08:06 kjerk

@kohya-ss I tested the regular mode of lora training, the problem repeats. Thus, the problem does not depend on custom network.module.

Jun 06 '25 08:06 deGENERATIVE-SQUAD

@kjerk

Triton uninstalled entirely.

By the way, I have a triton, so it's not the triton's fault.

Jun 06 '25 08:06 deGENERATIVE-SQUAD

@kohya-ss I tested the regular mode of lora training, the problem repeats. Thus, the problem does not depend on custom network.module.

Thank you for your investigation. It's very strange that works in my environment.

I'd like to fix the code, but I'd also like to find out the cause if possible.

Jun 06 '25 12:06 kohya-ss

I just updated and was suffering this bug, I spent hours trying to understand what was wrong until I finally stumbled across this thread. I'm doing standard Lora training with Adafactor. This bug was popping up despite using the same parameters I had before I updated. I even did a fresh install with no luck.

Swapping out that line of code suddenly made it start training, but it was training at a much slower rate than the previous version I'd been using. It went from 1.3 seconds per step to 2.3 seconds. I'm really confused why this bug is still here over a month later for me.

Jul 04 '25 09:07 Nineball459

@Nineball459

It went from 1.3 seconds per step to 2.3 seconds.

Strange, no difference on my side. Show your full config please.

Jul 06 '25 15:07 deGENERATIVE-SQUAD

@Nineball459

It went from 1.3 seconds per step to 2.3 seconds.

Strange, no difference on my side. Show your full config please.

I'm using Kohya_SS GUI, I dl'd the latest version and my PC is an RTX 4070 Super with a 7800X3D I cant give more info than that unfortunately as ive already rolled back to a Kohya_SS GUI commit from december and no longer have any issues and my training speed has returned to normal.

Jul 09 '25 03:07 Nineball459

The issue is

https://github.com/kohya-ss/sd-scripts/blob/7c075a9c8d234fccf8e0d66b9538a0b17bf4b13f/library/train_util.py#L5977-L6011

noise_scheduler.add_noise() moves the alphas_cumprod to the GPU

And you can see it how it is working in the DDIMScheduler

https://github.com/huggingface/diffusers/blob/a4df8dbc40e170ff828f8d8f79c2c861c9f1748d/src/diffusers/schedulers/scheduling_ddim.py#L474-L498

So this issue is for sd and sdxl models specifically. We can move the alphas_cumprod back to the CPU after.

Jul 15 '25 23:07 rockerBOO