OneTrainer
OneTrainer copied to clipboard
[Bug]: Multi GPU error on Windows: unsupported gloo device
What happened?
Since recent updates, activating multi GPU on Windows generates an error (unsupported gloo device) and training is stopped.
What did you expect would happen?
Using several GPUs on windows.
Relevant log output
activating venv C:\OneTrainer\OneTrainer\venv
Using Python "C:\OneTrainer\OneTrainer\venv\Scripts\python.exe" -X utf8
HF_HUB_DISABLE_XET=1
NOTE: Xet disabled, to enable it set as 0 before launch
Checking Python version...
Python 3.12.10
Starting UI...
Warning: If 'Samples To Tensorboard' is enabled, only one GPU is used for sampling!
Traceback (most recent call last):
File "C:\OneTrainer\OneTrainer\modules\ui\TrainUI.py", line 758, in __training_thread_function
trainer.train()
File "C:\OneTrainer\OneTrainer\modules\trainer\MultiTrainer.py", line 90, in train
MultiTrainer._train_process(-1, world_size, config_dict, devices, callbacks=self.callbacks) #main process is rank #0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OneTrainer\OneTrainer\modules\trainer\MultiTrainer.py", line 49, in _train_process
torch.distributed.init_process_group(rank=rank, world_size=world_size, device_id=device, timeout=timeout,
File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1764, in init_process_group
default_pg, _ = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\OneTrainer\OneTrainer\venv\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1991, in _new_process_group_helper
backend_class = ProcessGroupGloo(
^^^^^^^^^^^^^^^^^
RuntimeError: makeDeviceForHostname(): unsupported gloo device
Generate and upload debug_report.log
According to https://github.com/pytorch/pytorch/issues/150381 the only known workaround is to downgrade to torch 2.7.1. Upgrading to a newer version doesn't help yet, even torch 2.9 seems to be affected.
Torch 2.7.1 should still be compatible with current OneTrainer, as long as you keep "Compile Transformer Blocks" disabled