hivemind
hivemind copied to clipboard
[BUG] Can't connect two machines locally
Describe the bug I have two machines: a PC and a Laptop which I tried connecting to each other (they are connected to the same Wi-Fi network). I am running the basic example from get-started page on hivemind documentation.
To Reproduce This is PC's (master) script:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.auto import tqdm
import hivemind
# Create dataset and model, same as in the basic tutorial
# For this basic tutorial, we download only the training set
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Create DHT: a decentralized key-value storage shared between peers
dht = hivemind.DHT(start=True)
print("To join the training, use initial_peers =", [str(addr) for addr in dht.get_visible_maddrs()])
# Set up a decentralized optimizer that will average with peers in background
opt = hivemind.Optimizer(
dht=dht, # use a DHT that is connected with other peers
run_id='my_cifar_run', # unique identifier of this collaborative run
batch_size_per_step=32, # each call to opt.step adds this many samples towards the next epoch
target_batch_size=10000, # after peers collectively process this many samples, average weights and begin the next epoch
optimizer=opt, # wrap the SGD optimizer defined above
use_local_updates=True, # perform optimizer steps with local gradients, average parameters in background
matchmaking_time=3.0, # when averaging parameters, gather peers in background for up to this many seconds
averaging_timeout=10.0, # give up on averaging if not successful in this many seconds
verbose=True # print logs incessently
)
# Note: if you intend to use GPU, switch to it only after the decentralized optimizer is created
with tqdm() as progressbar:
while True:
for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=32):
opt.zero_grad()
loss = F.cross_entropy(model(x_batch), y_batch)
loss.backward()
opt.step()
progressbar.desc = f"loss = {loss.item():.3f}"
progressbar.update()
This is laptop's script (slave, I also copy-paste the address every time from the PC-machine)
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.auto import tqdm
import sys
import hivemind
# Create dataset and model, same as in the basic tutorial
# For this basic tutorial, we download only the training set
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Create DHT: a decentralized key-value storage shared between peers
dht = hivemind.DHT(initial_peers=['/ip4/127.0.0.1/tcp/33849/p2p/12D3KooWRTVqWbs5Xmj5Xn4tHETY7SgJfFyFpMHsw3p4WmCAgzqo'], start=True)
# Set up a decentralized optimizer that will average with peers in background
opt = hivemind.Optimizer(
dht=dht, # use a DHT that is connected with other peers
run_id='my_cifar_run', # unique identifier of this collaborative run
batch_size_per_step=32, # each call to opt.step adds this many samples towards the next epoch
target_batch_size=10000, # after peers collectively process this many samples, average weights and begin the next epoch
optimizer=opt, # wrap the SGD optimizer defined above
use_local_updates=True, # perform optimizer steps with local gradients, average parameters in background
matchmaking_time=3.0, # when averaging parameters, gather peers in background for up to this many seconds
averaging_timeout=10.0, # give up on averaging if not successful in this many seconds
verbose=True # print logs incessently
)
opt.load_state_from_peers()
# Note: if you intend to use GPU, switch to it only after the optimizer is created
with tqdm() as progressbar:
while True:
for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=32):
opt.zero_grad()
loss = F.cross_entropy(model(x_batch), y_batch)
loss.backward()
opt.step()
progressbar.desc = f"loss = {loss.item():.3f}"
progressbar.update()
Output from master shows expected work (extract shown below):
Files already downloaded and verified
To join the training, use initial_peers = ['/ip4/127.0.0.1/tcp/32993/p2p/12D3KooWA4xVMicye2LEUki4kiL9KR46G2bSqtaTCLVyjw8AHeaq']
Oct 09 16:35:22.909 [INFO] Found no active peers: None
Oct 09 16:35:22.913 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, provide initialize_optimizer=False
Oct 09 16:35:22.933 [WARN] [hivemind.averaging.matchmaking.__init__:54] It is recommended to use request_timeout smaller than min_matchmaking_time. Otherwise, matchmaking can cause deadlocks in some rare cases. Please see Matchmaking docstring.
loss = 2.080: 303it [00:02, 114.72it/s]Oct 09 16:35:25.749 [INFO] Transitioning to epoch 1
loss = 2.108: 327it [00:02, 113.77it/s]Oct 09 16:35:25.913 [INFO] my_cifar_run accumulated 576 samples for epoch #1 from 1 peers. ETA 2.56 sec (refresh in 0.64 sec)
loss = 1.896: 399it [00:03, 114.68it/s]Oct 09 16:35:26.556 [INFO] my_cifar_run accumulated 2912 samples for epoch #1 from 1 peers. ETA 1.94 sec (refresh in 0.50 sec)
loss = 1.911: 459it [00:04, 114.34it/s]Oct 09 16:35:27.060 [INFO] my_cifar_run accumulated 4768 samples for epoch #1 from 1 peers. ETA 1.43 sec (refresh in 0.50 sec)
loss = 1.989: 519it [00:04, 114.52it/s]Oct 09 16:35:27.562 [INFO] my_cifar_run accumulated 6592 samples for epoch #1 from 1 peers. ETA 0.92 sec (refresh in 0.50 sec)
loss = 1.817: 567it [00:05, 113.86it/s]Oct 09 16:35:28.068 [INFO] my_cifar_run accumulated 8448 samples for epoch #1 from 1 peers. ETA 0.41 sec (refresh in 0.50 sec)
loss = 1.982: 615it [00:05, 113.49it/s]Oct 09 16:35:28.487 [INFO] Transitioning to epoch 2
Slave(Laptop) output:
Files already downloaded and verified
Traceback (most recent call last):
File "/home/schukark/vk/hivemind_test/launch.py", line 23, in <module>
dht = hivemind.DHT(initial_peers=['/ip4/127.0.0.1/tcp/32993/p2p/12D3KooWA4xVMicye2LEUki4kiL9KR46G2bSqtaTCLVyjw8AHeaq'], start=True)
File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 88, in __init__
self.run_in_background(await_ready=await_ready)
File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 148, in run_in_background
self.wait_until_ready(timeout)
File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 151, in wait_until_ready
self._ready.result(timeout=timeout)
File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/utils/mpfuture.py", line 262, in result
return super().result(timeout)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2024/10/09 16:36:57 failed to connect to bootstrap peers
Environment
- Python version - 3.10.12 (everything is inside venv)
- Hivemind version - 1.1.10.post2
- Environment collection script output:
Collecting environment information...
PyTorch version: 2.4.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 18
On-line CPU(s) list: 0-17
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 5900X 12-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 9
Socket(s): 1
Stepping: 2
BogoMIPS: 7400.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 288 KiB (9 instances)
L1i cache: 288 KiB (9 instances)
L2 cache: 4.5 MiB (9 instances)
L3 cache: 32 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] torch==2.4.1
[pip3] torchvision==0.19.1
[pip3] triton==3.0.0
[conda] Could not collect
Hi @schukark! It looks like you get only '/ip4/127.0.0.1 as initial peers, which refers to localhost and likely leads to the peer being unable to connect to the other worker. Can you provide an example output of dht.get_visible_maddrs(), please?
Hey, this is what I get when I run it on the master:
[<Multiaddr /ip4/127.0.0.1/tcp/41949/p2p/12D3KooWM3MF5nRikSfW3w8Ai2i3hz9k7XPR6shfNevJD2Y7AWeN>]