orleans
orleans copied to clipboard
Latency degrades with number of Silos (even on local grains)
I have a scenario in which grains exhibit a parent/child relationship. One grain has a number of child grains that talk to each other, until an operation is finished and the end is reached. The following diagram illustrates this, as well as my benchmark setup:
The key requirement is that these child grains must live locally, as the application is highly latency-sensitive. Their lifetimes are generally short, with some reused for multiple calls but often serving exactly one call (as reflected in my benchmark).
Issue: Scaling overhead with number of silos
I expected inter-child grain communication to remain local and scale efficiently with PreferLocalPlacement or StatelessWorker, but results show that latency scales with the number of silos, regardless of placement strategy.
Here’s what I observed:
- Random Placement: As expected, performance drops significantly beyond one silo. Just there to establish a baseline.
- PreferLocalPlacement: Helps but does not improve performance nearly as much as expected.
- StatelessWorker: Performs much better than PreferLocalPlacement (since it guarantees local execution), but still suffers latency degradation as silo count increases.
- Custom Grain Directory: A naive dictionary-based directory showed similar performance to StatelessWorker.
Benchmark results
Small workload (10 child calls)
| Number of inner children | Number of Silos | Calls k/s (Random Placement) | Calls k/s (Prefer Local) | Calls k/s (Statless Worker) | Calls k/s (Custom Grain Directory) |
|---|---|---|---|---|---|
| 10 | 1 | 4000 | 4200 | 3500 | 4000 |
| 10 | 2 | 670 | 880 | 1100 | 1400 |
| 10 | 3 | 515 | 710 | 1000 | 1000 |
| 10 | 6 | 390 | 550 | 800 | 780 |
Larger workload (50 child calls)
| Number of inner children | Number of Silos | Calls k/s (Random Placement) | Calls k/s (Prefer Local) | Calls k/s (Stateless Worker) | Calls k/s (Custom Grain Directory) |
|---|---|---|---|---|---|
| 50 | 1 | 1100 | 1140 | 1000 | 1250 |
| 50 | 2 | 140 | 205 | 290 | 330 |
| 50 | 3 | 110 | 160 | 230 | 240 |
| 50 | 6 | 80 | 120 | 190 | 190 |
Now I know that I can model what my child grains do without Orleans to fix this, but my real world example in which this is an issue is of course more complex. I need Orleans concurrency model and have a few child grains that are more complex (they use state, call other grains etc.)
The benchmark is run with localhost clustering, but behavior was first observed in our real world application when running on AKS. My question is whether this is expected behavior and if there's anything else that can be done to optimize the latency for known-to-be-local grains?
Have you considered managing everything in a single grain instead of splitting your workflow across child grains? This should be feasible since you have a neat hierarchy. You can still model your domain using multiple classes, but coordinate the instances in a single grain.
I doubt you can improve the performance a lot since i'd indeed expect to see at least an order of magnitude of difference in speed between in-memory processing and a remote procedure call going to a different host.
I do not know your requirements, but beyond performance, encapsulating everything in a single grain would be a more natural design in my opinion to have better control over the workflow (otherwise, invalid operations/state may be created by invoking the child grains directly). Performance would just be an additional argument in favor of this design.
I have, and the scenario pictured here could indeed easily be modelled using a single grain, but not my real world scenario, which uses a lot of Orleans features that I would have to be reimplement, which I'd love to prevent. They also talk to each other and have associated state (but not in the benchmark).
But to be perfectly clear, I'm not talking about RPC latency to a different host, I'm strictly talking about the latency between grains living on the same silo. All my child grains are guaranteed to be local to each other, have the stateless worker (or prefer local) attribute, and should thus not incur any networking or clustering cost, but they somehow still do.
Is your test scenario code on github?
@ReubenBond I've just uploaded it here: https://github.com/jSharpWolf/orleans-local-grain-benchmark
Usage is described in the Readme if necessary. The code is a bit weird so I can host basically the same grain implementation with different placements.
Thanks, @jSharpWolf. I'm taking a look.
Let me know if my interpretation of the code is correct: the number of "inner children" is the call chain depth, you make one call to each child and then discard the entire chain. Each iteration is spawning an entirely new set of grains, meaning you're incurring the directory cost for each call/inner call.
The calls are sent from an HTTP server via an external client to the cluster. The parent grain is always randomly placed, so for each call, the client will pick a random gateway (determined per grain, parent grain ids are random in this case). The gateway silo will randomly place the parent, and the parent will use various strategies to place the children.
As the number of silos increases, the chance of 2 events occurring increases:
- The parent is placed on a different host from the gateway.
- The directory partition for the parent is hosted on a different node from the parent.
Similarly, as the call chain depth increases, the number of grains which need to be registered on a remote directory partition increases as long as there's more than 1 host.
Your custom grain directory isn't distributed, so it never incurs RPC costs (and there's a separate instance of the directory per silo, which in a real system would result in duplicate activations).
Does that sound like an accurate description?
I ran a quick simulation (code here) to work out what the impact would be if an RPC was 25x the cost of a local call. For a call-chain depth of 10 for the random parent + prefer local children scenario. This is what it gives me. It seems to be in-line with your results.
| N (Host Count) | Relative Latency | Relative Throughput |
|---|---|---|
| 1 | 1.00x | 1.00x |
| 2 | 4.78x | 0.21x |
| 3 | 5.94x | 0.17x |
| 6 | 7.02x | 0.14x |
EDIT: Randomly placed parents + randomly placed children:
| N (Host Count) | Relative Latency | Relative Throughput |
|---|---|---|
| 1 | 1.00x | 1.00x |
| 2 | 6.54x | 0.15x |
| 3 | 8.19x | 0.12x |
| 6 | 9.78x | 0.10x |
This is also in-line with what you saw.
Randomly placed parents + stateless worker children:
| N (Host Count) | Relative Latency | Relative Throughput |
|---|---|---|
| 1 | 1.00x | 1.00x |
| 2 | 1.64x | 0.61x |
| 3 | 1.95x | 0.51x |
| 6 | 2.04x | 0.49x |
This is not in-line with what you saw. It's out by about a factor of 2
Thanks @ReubenBond for helping me look into this. And yes, your summary above is accurate. Just maybe one clarification: The parent grain is randomly placed, which is exactly what I intended scaling purpose but their children should always be on the same silo as the parent grain, as they are pretty chatty (by-design). No child of parent X will ever need to call parent Y or any child of Y, I can always assume they are local (which is why the stateless workers fit my use-case pretty well). I tested the random child placement as well, but only to get a baseline, not because I care about that or expect it to perform better.
So all in all so far I assumed I had at most two RPCs with the rest being purely local calls like you describe:
- One call going from API to the "primary" Silo it is connected to (= 1 RPC)
- As more than one Silo joins, the grain is randomly placed and the likelihood of it being an RPC to yet another Silo increases (= 2 RPC)
For the prefer local case: I didn't think about directory partitioning so far and that a lookup had to be done, to be honest and it's a fair point and explains well why the 2 RPCs mentioned above might be more RPCs and more expensive than I thought in that scenario.
What it doesn't really explain is why it's off by a factor of two for stateless grains (which by itself wouldn't be a big deal) but as the number of calls between (purely local stateless worker) children increases from 10 to 50, latency gets (much) worse, even though there shouldn't be any single new RPC compared to when there were only 10.