⛔[SD3 branch] Huber+SNR RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Pytorch 2.7.0, CUDA 12.8
What: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when using Huber + SNR
Where: train_util.py, line 6025
Fix:
From
alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu())
To
alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.to(noise_scheduler.alphas_cumprod.device))
It seems to work without fix in my environment. Could you please share the command line arguments for training?
@kohya-ss
Traceback (most recent call last):
File "K:\sd-scripts\sd-scripts-sd3\sdxl_train_network.py", line 229, in <module>
trainer.train(args)
File "K:\sd-scripts\sd-scripts-sd3\train_network.py", line 1403, in train
loss = self.process_batch(
^^^^^^^^^^^^^^^^^^^
File "K:\sd-scripts\sd-scripts-sd3\train_network.py", line 463, in process_batch
huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "K:\sd-scripts\sd-scripts-sd3\library\train_util.py", line 6025, in get_huber_threshold_if_needed
alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
steps: 0%| | 0/1000 [00:19<?, ?it/s]
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Scripts\accelerate.exe\__main__.py", line 7, in <module>
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 50, in main
args.func(args)
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 1198, in launch_command
simple_launcher(args)
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\commands\launch.py", line 785, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)`
`accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
--pretrained_model_name_or_path="K:/model.safetensors" ^
--train_data_dir="data" ^
--output_dir="output_dir" ^
--output_name="000" ^
--network_args "algo=boft" "rescaled=True" "constraint=1" "conv_dim=32" "conv_alpha=1" "preset=full" "train_norm=False" ^
--resolution="864" ^
--save_model_as="safetensors" ^
--network_module="lycoris.kohya" ^
--shuffle_caption ^
--max_train_epochs=100 ^
--save_every_n_epochs=1 ^
--save_state_on_train_end ^
--save_precision=bf16 ^
--network_dim=32 ^
--network_alpha=1 ^
--train_batch_size=2 ^
--gradient_accumulation_steps=2 ^
--max_data_loader_n_workers=1 ^
--enable_bucket ^
--bucket_reso_steps=64 ^
--min_bucket_reso=768 ^
--max_bucket_reso=1280 ^
--mixed_precision="bf16" ^
--caption_extension=".txt" ^
--gradient_checkpointing ^
--optimizer_type="prodigyplus.prodigy_plus_schedulefree.ProdigyPlusScheduleFree" ^
--optimizer_args "d0=1e-5" "d_limiter=False" "use_stableadamw=False" "weight_decay_by_lr=False" ^
--learning_rate=1 ^
--network_train_unet_only ^
--loss_type="huber" ^
--huber_schedule="snr" ^
--huber_c=0.1 ^
--xformers ^
--vae="K:/sdxl_vae_fix.safetensors" ^
--full_bf16 ^
--mem_eff_attn ^
--seed=1 ^
--logging_dir="logs" ^
--log_with="tensorboard" ^
--persistent_data_loader_workers ^`
Thank you for shariing the command. I don't have LyCORIS installed so I can't test that. The command gives the error: TypeError: ProdigyPlusScheduleFree.__init__() got an unexpected keyword argument 'd_limiter'.
I think prodigy_plus_schedulefree or LyCORIS might be the reason. Could you try removing them to see what is causing the error?
@kohya-ss
This definitely doesn’t depend on the optimizer being used - the same issue occurs with Adafactor or any other optimizer. d_limiter is not the root of the problem, it is a Prodigy Plus argument (see: https://github.com/LoganBooker/prodigy-plus-schedule-free/blob/main/prodigyplus/prodigy_plus_schedulefree.py), and it works fine.
I haven’t tested without LyCORIS yet, since I’m using non-standard algorithms and haven’t used the standard LoRA module - but I can try it later.
Yeah I have had this issue myself as well but haven't identified why it happens sometimes. Probably need to identify why the alpha_cumprod is on non-CPU device and then this would work as expected.
Pytorch 2.7.0, CUDA 12.8
What: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! when using Huber + SNR
Where: train_util.py, line 6025
Fix:
From
alphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.cpu())Toalphas_cumprod = torch.index_select(noise_scheduler.alphas_cumprod, 0, timesteps.to(noise_scheduler.alphas_cumprod.device))
Same problem and fix worked on my Linux box as well. I'm running standard LoRA training so I don't think LyCORIS is involved.
Torch 2.7 CU 12.6, ProdigyPlusScheduleFree, loss_type: Huber, Huber schedule: SNR, SDPA attention. Triton uninstalled entirely.
@kohya-ss I tested the regular mode of lora training, the problem repeats. Thus, the problem does not depend on custom network.module.
@kjerk
Triton uninstalled entirely.
By the way, I have a triton, so it's not the triton's fault.
@kohya-ss I tested the regular mode of lora training, the problem repeats. Thus, the problem does not depend on custom network.module.
Thank you for your investigation. It's very strange that works in my environment.
I'd like to fix the code, but I'd also like to find out the cause if possible.
I just updated and was suffering this bug, I spent hours trying to understand what was wrong until I finally stumbled across this thread. I'm doing standard Lora training with Adafactor. This bug was popping up despite using the same parameters I had before I updated. I even did a fresh install with no luck.
Swapping out that line of code suddenly made it start training, but it was training at a much slower rate than the previous version I'd been using. It went from 1.3 seconds per step to 2.3 seconds. I'm really confused why this bug is still here over a month later for me.
@Nineball459
It went from 1.3 seconds per step to 2.3 seconds.
Strange, no difference on my side. Show your full config please.
It went from 1.3 seconds per step to 2.3 seconds.
Strange, no difference on my side. Show your full config please.
I'm using Kohya_SS GUI, I dl'd the latest version and my PC is an RTX 4070 Super with a 7800X3D I cant give more info than that unfortunately as ive already rolled back to a Kohya_SS GUI commit from december and no longer have any issues and my training speed has returned to normal.
The issue is
https://github.com/kohya-ss/sd-scripts/blob/7c075a9c8d234fccf8e0d66b9538a0b17bf4b13f/library/train_util.py#L5977-L6011
noise_scheduler.add_noise() moves the alphas_cumprod to the GPU
And you can see it how it is working in the DDIMScheduler
https://github.com/huggingface/diffusers/blob/a4df8dbc40e170ff828f8d8f79c2c861c9f1748d/src/diffusers/schedulers/scheduling_ddim.py#L474-L498
So this issue is for sd and sdxl models specifically. We can move the alphas_cumprod back to the CPU after.