petals [Feature Request] Direct server-server communication ("and then" clause)

Based on conversations with @borzunov , @dbaranchuk

Premise: currently in rpc_inference, each client sends inputs to a given server, collects responses from that server, then sends this input manually to the next server; this is needed for full fault-tolerance, in case one of the servers disconnects. A faster option is to send data directly from server 1 to server 2, if we can make it without compromising fault-tolerance -- and without insane code complexity.

Proposed solution: in rpc_inference, whenever a client sends a pb2 request, it can add a metadata key, e.g. "next_peer", which denotes the peer id of the next server. When a server finishes computing that key, it will immediately send results to the specified peer_id and marks it as "hidden states for session {inference_session_id}" - assuming that the next peer currently takes part in the same session.

On the receiving end, each server awaits asyncio.wait(request_from_client, request_from_previous_server), whichever comes first. If the request from previous server came first, current server will begin processing it immediately, but will still wait for the client's data to ensure that the results are valid.

Sending data to the next server is not guaranteed: the requested server will simply fire a request and forget about it. Notably, the server will still return hidden states to the client as usual. The extra communication is fine because rpc_inference performance does not use much network throughput ("mbps"), being more sensitive to latency ("ping").

Notes:

client can request a different next_peer after each new inference step. This happens if one of the "next servers" disconnected from the inference session. Servers should send each hidden_states to the server that was specified in the current request.next_peer
if a server receives a request that doesn't correspond to any active session, it simply ignores the request. this is fine because if that request was valid, the client will still send the same data later
[security] since the previous server can be faulty/malicious, the "next peer" server should check that the data it received from previous peer is equal to the data it eventually received from client; when we implement full verification, the server can simply sign the next peer message so it can be used as a proof of (benign or malicious) activity
- if this took place, a server may have to re-send inference message; we can support this by specifying the current length in the server's response
[security] the server-to-server traffic caused by the client is strictly less than client-to-server traffic, which eliminates the potential misuse via ddos amplification
the current-best routing strategy would still work decently for this algorithm because it uses a strictly non-optimistic (time>=actual) performance model

@dbaranchuk also proposed a clever alternative solution, where each server runs its own fault-tolerant inference session to subsequent servers. This can be a better solution If we find a way to limit the memory / bandwidth usage on a server.

Jan 19 '23 04:01 justheuristic

@slush0 on Discord:

Regarding this feature request, I have one comment/idea: It would be great to be able to set "family" of the node and use this set as a hint for clients for picking "next hop". For example, I'll build two or three gpu rigs out of cheap components and they'll be sitting each to other on local lan. Having the way to tell clients "pick one of this peer ids for next hop" would significantly reduce the latencies of the whole round.

Feb 20 '23 10:02 borzunov

Does this feature request also apply to training?

I was looking at the description and hoping to get some clarification on the idea of the next_server waiting for the data from the client and using that to ensure the results are valid. What data is this and what is the rational for using this to confirm authenticity—is this to guard against rouge servers sending data it was not asked to compute by a client? Would this be done at random or all the time? If it is the hidden states then aren’t we still involving the client and increasing hops?

Some background: I have noticed that much of the training cost is coming from egress and I am trying to figure out if this is something that can be optimized. Otherwise, it might actually be cheaper to “fine tune” on a personal swarm of 4 A100’s for 7.50 and hour vs paying 1.00 per iteration (in egress costs) each taking 5-10 minutes when using a large number of prompts.

Apr 02 '23 19:04 smeyerhot