petals
petals copied to clipboard
Investigate QUIC (v1) reliability
Our network layer supports quic like this: hivemind.DHT(..., host_maddrs=['/ip4/1.2.3.4/udp/1337/quic'])
However, petals servers currently default to TCP-only host maddrs, unless user specifies --host_maddrs.
In other hivemind-based experiments, we found that QUIC is superior to TCP when operating under household NAT because udp hole punching is more reliable than tcp hole punching. It would be great if we could enable it by default.
The reason why QUIC is in default maddrs is that we haven't tested it thoroughly enough and we fear that it might cause throughput issues.
Quest: try running a quic-only peer in the public swarm, bombard it with requests from (your pc, colab, some publicly accessible machine), check if it works alright.
Criteria (suggestion):
- cycles per second, forward and inference (vs TCP)
- retries / relay fallbacks (vs TCP)
We should check for cases where QUIC makes the system unusable (10x slow or does not work at all).
If some cases are slower by tens of %%, this is fine. If there are cases where quic is 2x slower or similar, we can check that running a server with both tcp and quic is still as fast as tcp-only - and if so, it is fine to enable quic in main.
👋
I solemnly swear to add a quick summary of:
- quic in petals / hivemind / libp2p (how it currently connects with our code)
- links to similar projects (here or via discord DM)
How petals server interfaces with quic-go?
You run python -m petals.cli.cun_server $OTHERSTUFF --host_maddrs /ip4/so/me/thing
(click to see intermediate steps)
- it parses host_maddrs here and creates a petals.server.Server
- a server creates
hivemind.DHTand passes host_maddrs as **kwargs - hivemind.dht.DHT is a background process that implements kademlia DHT - but we use it's network connections for all peer-to-peer communication - inside hivemind.DHT, it creates a P2P instance and passes host_maddrs there - hivemind.p2p.P2P is a pythonic interface to a go-libp2p daemon
- inside hivemind.p2p.P2P, it runs a go-libp2p daemon and passes host_maddrs to that
- go-libp2p daemon runs a libp2p host in golang and interfaces with python
Finally, the go-libp2p-daemon creates a libp2p host here and passes the host_maddrs option to that host. If host_maddrs contains a quic multiaddress, libp2p will use quic-go here.
... and then all our networking runs through that libp2p host.
Why is the code structured like this? (click to expand)
We need libp2p because it can do things like STUN / TURN in a swarm, without dedicated servers. We want to run in python because most ML/NLP researchers speak python. Libp2p has two well-maintained implementations in go and rust. There is a python version (here), but it lacks most of the things we need (see their readme). So, the best option we found was to use go-libp2p and interface it from python. Hence, go-libp2p-daemon.
On the python side, we manage the go-libp2p-daemon in hivemind.p2p.P2P. When it starts, it creates lib2p daemon in background and maintains a connection to it. When P2P shuts down (or is killed for any reason), that connection closes and go-libp2p-daemon terminates.
Finally, hivemind.DHT happens to be the first python component that uses networking - so it creates its own P2P inside. Other components (e.g. actual neural network computation will reuse the same underlying go-libp2p-daemon .
Maybe QUIC was faulty because of too many python abstractions on top of it? (click to expand)
I don't think so - but i could be wrong. Our TCP-based code runs through the same python code as QUIC up to go-libp2p daemon - and the TCP runs fast enough. So if python code is fast enough with TCP, why would the exact same code run slower when using QUIC? The quic-go code is used deep inside libp2p, after all python-based stuff is done.
Where to begin?
It is up to you, but one way you could start is by running a minimalistic Petals client - server pair.
If you have a GPU machine, you can a minimalistic server:
CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server bigscience/bloom-petals --block_indices 2:3
It will print, among other things, this:
Jan 13 00:30:14.917 [INFO] Connecting to the public swarm, peer_id = 12D3KooWG3RandomlyGeneratedNeedlesslyLongString9ryN
Please copy your server's peer id (12D..yN in the example) for
And then, on another machine, run a jupyter notebook and test that your server works:
peer_id_string = "PASTE YOUR PEER ID HERE"
block_uid = "bigscience/bloom-petals.2"
import torch
import hivemind
import petals
from tqdm.auto import trange
import petals.client.sequential_autograd as lowlv_stuff
try:
dht = hivemind.DHT(start=True, client_mode=True, initial_peers=petals.constants.PUBLIC_INITIAL_PEERS)
p2p = await dht.replicate_p2p()
dummy_inputs = [torch.rand(1, 128, 14336, dtype=torch.bfloat16), torch.empty(0, dtype=torch.bfloat16)]
peer_id = hivemind.PeerID.from_base58(peer_id_string)
stub = lowlv_stuff.TransformerConnectionHandler.get_stub(p2p, peer_id)
response = await stub.rpc_info(hivemind.proto.runtime_pb2.ExpertUID(uid=block_uid))
server_info = hivemind.MSGPackSerializer.loads(response.serialized_info)
for i in trange(10):
(outputs,) = await lowlv_stuff.run_remote_forward(
block_uid, stub, server_info, *dummy_inputs, timeout=15)
print("It works!")
finally:
print("shutting down")
await p2p.shutdown()
dht.shutdown()
If you're allergic to jupyter, just wrap everything with asyncio.run and you can run it in a normal python script.
If you don't have a GPU machine
We have a tiny test model that you can use to run a pocket swarm on your local machine:
https://github.com/bigscience-workshop/petals/blob/main/.github/workflows/run-tests.yaml#L69-L112
(run with export HF_TAG=main for simplicity)
And then?
... and then try how this works if server listens to quic (and not tcp). It should somewhat work if you simply add --host_maddrs= to quic multiaddr when running a petals.cli.run_server - but it's not clear if it's gonna be fast and/or reliable.
NB: when running with a public swarm, I'd recommend adding --use_auto_relay=False to a server. If you don't, your quic-only server will be able to communicate through a TCP relay even if QUIC doesn't actually work.
To reiterate, the most important things to check for are:
- cycles per second, forward and inference (vs TCP)
- retries / relay fallbacks (vs TCP), especially when sending large messages
Ideally, if you could write a mini-benchmark that measures these values, we both can run it in a bunch of different settings (e.g. client in colab, server behind nat, etc) to test it.
If somehow everything works out of the box, just create a PR that adds quic by default and mission accomplished :)