hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

[BUG] Can't connect two machines locally

Open schukark opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug I have two machines: a PC and a Laptop which I tried connecting to each other (they are connected to the same Wi-Fi network). I am running the basic example from get-started page on hivemind documentation.

To Reproduce This is PC's (master) script:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.auto import tqdm

import hivemind

# Create dataset and model, same as in the basic tutorial
# For this basic tutorial, we download only the training set
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
                      nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
                      nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Create DHT: a decentralized key-value storage shared between peers
dht = hivemind.DHT(start=True)
print("To join the training, use initial_peers =", [str(addr) for addr in dht.get_visible_maddrs()])

# Set up a decentralized optimizer that will average with peers in background
opt = hivemind.Optimizer(
    dht=dht,                  # use a DHT that is connected with other peers
    run_id='my_cifar_run',    # unique identifier of this collaborative run
    batch_size_per_step=32,   # each call to opt.step adds this many samples towards the next epoch
    target_batch_size=10000,  # after peers collectively process this many samples, average weights and begin the next epoch
    optimizer=opt,            # wrap the SGD optimizer defined above
    use_local_updates=True,   # perform optimizer steps with local gradients, average parameters in background
    matchmaking_time=3.0,     # when averaging parameters, gather peers in background for up to this many seconds
    averaging_timeout=10.0,   # give up on averaging if not successful in this many seconds
    verbose=True              # print logs incessently
)

# Note: if you intend to use GPU, switch to it only after the decentralized optimizer is created
with tqdm() as progressbar:
    while True:
        for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=32):
            opt.zero_grad()
            loss = F.cross_entropy(model(x_batch), y_batch)
            loss.backward()
            opt.step()

            progressbar.desc = f"loss = {loss.item():.3f}"
            progressbar.update()

This is laptop's script (slave, I also copy-paste the address every time from the PC-machine)

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.auto import tqdm
import sys

import hivemind

# Create dataset and model, same as in the basic tutorial
# For this basic tutorial, we download only the training set
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

model = nn.Sequential(nn.Conv2d(3, 16, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
                      nn.Conv2d(16, 32, (5, 5)), nn.MaxPool2d(2, 2), nn.ReLU(),
                      nn.Flatten(), nn.Linear(32 * 5 * 5, 10))
opt = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Create DHT: a decentralized key-value storage shared between peers
dht = hivemind.DHT(initial_peers=['/ip4/127.0.0.1/tcp/33849/p2p/12D3KooWRTVqWbs5Xmj5Xn4tHETY7SgJfFyFpMHsw3p4WmCAgzqo'], start=True)

# Set up a decentralized optimizer that will average with peers in background
opt = hivemind.Optimizer(
    dht=dht,                  # use a DHT that is connected with other peers
    run_id='my_cifar_run',    # unique identifier of this collaborative run
    batch_size_per_step=32,   # each call to opt.step adds this many samples towards the next epoch
    target_batch_size=10000,  # after peers collectively process this many samples, average weights and begin the next epoch
    optimizer=opt,            # wrap the SGD optimizer defined above
    use_local_updates=True,   # perform optimizer steps with local gradients, average parameters in background
    matchmaking_time=3.0,     # when averaging parameters, gather peers in background for up to this many seconds
    averaging_timeout=10.0,   # give up on averaging if not successful in this many seconds
    verbose=True              # print logs incessently
)

opt.load_state_from_peers()

# Note: if you intend to use GPU, switch to it only after the optimizer is created
with tqdm() as progressbar:
    while True:
        for x_batch, y_batch in torch.utils.data.DataLoader(trainset, shuffle=True, batch_size=32):
            opt.zero_grad()
            loss = F.cross_entropy(model(x_batch), y_batch)
            loss.backward()
            opt.step()

            progressbar.desc = f"loss = {loss.item():.3f}"
            progressbar.update()

Output from master shows expected work (extract shown below):

Files already downloaded and verified
To join the training, use initial_peers = ['/ip4/127.0.0.1/tcp/32993/p2p/12D3KooWA4xVMicye2LEUki4kiL9KR46G2bSqtaTCLVyjw8AHeaq']
Oct 09 16:35:22.909 [INFO] Found no active peers: None
Oct 09 16:35:22.913 [INFO] Initializing optimizer manually since it has no tensors in state dict. To override this, provide initialize_optimizer=False
Oct 09 16:35:22.933 [WARN] [hivemind.averaging.matchmaking.__init__:54] It is recommended to use request_timeout smaller than min_matchmaking_time. Otherwise, matchmaking can cause deadlocks in some rare cases. Please see Matchmaking docstring.
loss = 2.080: 303it [00:02, 114.72it/s]Oct 09 16:35:25.749 [INFO] Transitioning to epoch 1
loss = 2.108: 327it [00:02, 113.77it/s]Oct 09 16:35:25.913 [INFO] my_cifar_run accumulated 576 samples for epoch #1 from 1 peers. ETA 2.56 sec (refresh in 0.64 sec)
loss = 1.896: 399it [00:03, 114.68it/s]Oct 09 16:35:26.556 [INFO] my_cifar_run accumulated 2912 samples for epoch #1 from 1 peers. ETA 1.94 sec (refresh in 0.50 sec)
loss = 1.911: 459it [00:04, 114.34it/s]Oct 09 16:35:27.060 [INFO] my_cifar_run accumulated 4768 samples for epoch #1 from 1 peers. ETA 1.43 sec (refresh in 0.50 sec)
loss = 1.989: 519it [00:04, 114.52it/s]Oct 09 16:35:27.562 [INFO] my_cifar_run accumulated 6592 samples for epoch #1 from 1 peers. ETA 0.92 sec (refresh in 0.50 sec)
loss = 1.817: 567it [00:05, 113.86it/s]Oct 09 16:35:28.068 [INFO] my_cifar_run accumulated 8448 samples for epoch #1 from 1 peers. ETA 0.41 sec (refresh in 0.50 sec)
loss = 1.982: 615it [00:05, 113.49it/s]Oct 09 16:35:28.487 [INFO] Transitioning to epoch 2

Slave(Laptop) output:

Files already downloaded and verified
Traceback (most recent call last):
  File "/home/schukark/vk/hivemind_test/launch.py", line 23, in <module>
    dht = hivemind.DHT(initial_peers=['/ip4/127.0.0.1/tcp/32993/p2p/12D3KooWA4xVMicye2LEUki4kiL9KR46G2bSqtaTCLVyjw8AHeaq'], start=True)
  File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 88, in __init__
    self.run_in_background(await_ready=await_ready)
  File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 148, in run_in_background
    self.wait_until_ready(timeout)
  File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/dht/dht.py", line 151, in wait_until_ready
    self._ready.result(timeout=timeout)
  File "/home/schukark/vk/hivemind-1.1.10/hivemind-1.1.10/hivemind/utils/mpfuture.py", line 262, in result
    return super().result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
hivemind.p2p.p2p_daemon_bindings.utils.P2PDaemonError: Daemon failed to start: 2024/10/09 16:36:57 failed to connect to bootstrap peers

Environment

  • Python version - 3.10.12 (everything is inside venv)
  • Hivemind version - 1.1.10.post2
  • Environment collection script output:
Collecting environment information...
PyTorch version: 2.4.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             18
On-line CPU(s) list:                0-17
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 5900X 12-Core Processor
CPU family:                         25
Model:                              33
Thread(s) per core:                 2
Core(s) per socket:                 9
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           7400.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat umip vaes vpclmulqdq rdpid fsrm
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          288 KiB (9 instances)
L1i cache:                          288 KiB (9 instances)
L2 cache:                           4.5 MiB (9 instances)
L3 cache:                           32 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] torch==2.4.1
[pip3] torchvision==0.19.1
[pip3] triton==3.0.0
[conda] Could not collect

schukark avatar Oct 09 '24 13:10 schukark

Hi @schukark! It looks like you get only '/ip4/127.0.0.1 as initial peers, which refers to localhost and likely leads to the peer being unable to connect to the other worker. Can you provide an example output of dht.get_visible_maddrs(), please?

mryab avatar Oct 09 '24 13:10 mryab

Hey, this is what I get when I run it on the master:

[<Multiaddr /ip4/127.0.0.1/tcp/41949/p2p/12D3KooWM3MF5nRikSfW3w8Ai2i3hz9k7XPR6shfNevJD2Y7AWeN>]

schukark avatar Oct 09 '24 15:10 schukark