keripy icon indicating copy to clipboard operation
keripy copied to clipboard

optimize how we submit to witnesses

Open dhh1128 opened this issue 5 months ago • 2 comments

Feature request description/rationale

As I understand it, when we submit events to witnesses, we use a round-robin approach that, for N=5 witnesses, works like this:

# first loop - get initial receipts
receipts = []
for each witness:
    call witness with existing receipts, and ask for the new receipt
    add new receipt to receipts array

# second loop - share other receipts
for each witness except last:
    submit receipts from all other witnesses

This results in the following:

metric count explanation
requests to witnesses 9 every witness is called twice except the final witness in the first loop, which sees everybody elses receipts on the first call
bandwidth, requests to witnesses 9 * request_overhead + 20 * receipt_size each witness ends up seeing 4 other receipts
bandwidth, responses from witnesses 9 * response_overhead + 5 * receipt_size each witness returns one receipt but responds twice
latency 9 * request_response_roundtrip 9 calls made, one after the other
timeout 9 * reasonable_timeout_per_call cumulative timeout

I suggest we change to the following algorithm:

call all witnesses in parallel and ask for a receipt
wait until toad have responded or until timeout
call all witnesses with all other receipts

If I am analyzing correctly, this would give us the following metrics:

metric count explanation
requests to witnesses 10 every witness is called twice
bandwidth, requests to witnesses 10 * request_overhead + 20 * receipt_size each witness ends up seeing 4 other receipts
bandwidth, responses from witnesses 10 * response_overhead + 5 * receipt_size each witness returns one receipt but responds twice
latency 2 * request_response_roundtrip 2 phases, each fully parallel
timeout 2 * reasonable_timeout_per_call cumulative timeout

Is my analysis accurate? If so, we consume almost identical bandwidth (10 vs 9 calls, same amount of data transferred), but our latency is decreased almost by a factor of 5. This would speed up many KERI operations, as far as users perceive things.

A possible downside to this approach is that error handling gets more complicated, and also, the chances of things being in escrow goes up (unless we don't begin phase 2 until all witnesses respond or time out). And of course, running stuff in parallel is a bit more complicated. But making it noticeable faster to talk to witnesses still seems to me like it might be worth the tradeoff.

What do you think?

dhh1128 avatar Aug 29 '24 20:08 dhh1128