keripy
keripy copied to clipboard
optimize how we submit to witnesses
Feature request description/rationale
As I understand it, when we submit events to witnesses, we use a round-robin approach that, for N=5 witnesses, works like this:
# first loop - get initial receipts
receipts = []
for each witness:
call witness with existing receipts, and ask for the new receipt
add new receipt to receipts array
# second loop - share other receipts
for each witness except last:
submit receipts from all other witnesses
This results in the following:
metric | count | explanation |
---|---|---|
requests to witnesses | 9 | every witness is called twice except the final witness in the first loop, which sees everybody elses receipts on the first call |
bandwidth, requests to witnesses | 9 * request_overhead + 20 * receipt_size | each witness ends up seeing 4 other receipts |
bandwidth, responses from witnesses | 9 * response_overhead + 5 * receipt_size | each witness returns one receipt but responds twice |
latency | 9 * request_response_roundtrip | 9 calls made, one after the other |
timeout | 9 * reasonable_timeout_per_call | cumulative timeout |
I suggest we change to the following algorithm:
call all witnesses in parallel and ask for a receipt
wait until toad have responded or until timeout
call all witnesses with all other receipts
If I am analyzing correctly, this would give us the following metrics:
metric | count | explanation |
---|---|---|
requests to witnesses | 10 | every witness is called twice |
bandwidth, requests to witnesses | 10 * request_overhead + 20 * receipt_size | each witness ends up seeing 4 other receipts |
bandwidth, responses from witnesses | 10 * response_overhead + 5 * receipt_size | each witness returns one receipt but responds twice |
latency | 2 * request_response_roundtrip | 2 phases, each fully parallel |
timeout | 2 * reasonable_timeout_per_call | cumulative timeout |
Is my analysis accurate? If so, we consume almost identical bandwidth (10 vs 9 calls, same amount of data transferred), but our latency is decreased almost by a factor of 5. This would speed up many KERI operations, as far as users perceive things.
A possible downside to this approach is that error handling gets more complicated, and also, the chances of things being in escrow goes up (unless we don't begin phase 2 until all witnesses respond or time out). And of course, running stuff in parallel is a bit more complicated. But making it noticeable faster to talk to witnesses still seems to me like it might be worth the tradeoff.
What do you think?