OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: Multi GPU error on Windows: unsupported gloo device

Open hyppyhyppo opened this issue 1 month ago • 1 comments

What happened?

Since recent updates, activating multi GPU on Windows generates an error (unsupported gloo device) and training is stopped.

What did you expect would happen?

Using several GPUs on windows.

Relevant log output

activating venv C:\OneTrainer\OneTrainer\venv
Using Python "C:\OneTrainer\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1

NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.12.10

Starting UI...
Warning: If 'Samples To Tensorboard' is enabled, only one GPU is used for sampling!
Traceback (most recent call last):
  File "C:\OneTrainer\OneTrainer\modules\ui\TrainUI.py", line 758, in __training_thread_function
    trainer.train()
  File "C:\OneTrainer\OneTrainer\modules\trainer\MultiTrainer.py", line 90, in train
    MultiTrainer._train_process(-1, world_size, config_dict, devices, callbacks=self.callbacks) #main process is rank #0
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\OneTrainer\OneTrainer\modules\trainer\MultiTrainer.py", line 49, in _train_process
    torch.distributed.init_process_group(rank=rank, world_size=world_size, device_id=device, timeout=timeout,
  File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1764, in init_process_group
    default_pg, _ = _new_process_group_helper(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1991, in _new_process_group_helper
    backend_class = ProcessGroupGloo(
                    ^^^^^^^^^^^^^^^^^
RuntimeError: makeDeviceForHostname(): unsupported gloo device

Generate and upload debug_report.log

OneTrainer_debug_report.zip

hyppyhyppo avatar Nov 13 '25 12:11 hyppyhyppo

According to https://github.com/pytorch/pytorch/issues/150381 the only known workaround is to downgrade to torch 2.7.1. Upgrading to a newer version doesn't help yet, even torch 2.9 seems to be affected.

Torch 2.7.1 should still be compatible with current OneTrainer, as long as you keep "Compile Transformer Blocks" disabled

dxqb avatar Nov 13 '25 14:11 dxqb