orleans
orleans copied to clipboard
Timeout error from silo when number of connections goes beyond 10k
Hi Team, we are having following services:
- Microsoft SingalR Hub service.
- Orleans as a backplane for SignalR (created the Silo host and registered signalR hub as silo client)
We have deployed these apps on EKS and have assigned 6 Core CPU and 6G memory to Silo Pod, hub has been provisioned moderately with 2core CPU and 2GB memory, we are trying performance tests and trying to figure out the optimum load single silo can handle.
With this setup in place, we are facing time-out errors from the Silo side while making the ClientGrain connections, when our singalR connection count goes beyond 10k, We observed the latency on Silo dashboard it gets increased very massively after a certain number of connections.
Can you guys help me to come up with the optimum number of connections a single silo should support and optimum resources like memory and CPU we should assign to Silo pod.
Note: I have used two silo hosts to share the load with LOAD_SHEDDING_LIMIT set to 30.
Your help will be highly appreciated as we are kind of stuck till 10k connections currently.
Am I correctly assuming that you have 10K ClusterClients? Orleans clusters usually only have a small number of connections. You are supposed to create one ClusterClient per host and keep it around for the lifetime of your application host.
Hi @kishor-raskar We have added the "Needs: Author Feedback" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.
Hi @kishor-raskar We have added the "Needs: Author Feedback" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.
Hi @kishor-raskar We have added the "Needs: Author Feedback" label to this issue, which indicates that we have an open question for you before we can take further action. This issue will be closed automatically in 7 days if we do not hear back from you by then - please feel free to re-open it if you come back to this issue after that time.
Hi @ReubenBond, we are not having actually 10k ClusterClients, we have one Silo Server instance, and one SignalR hub instance which we are registering as ClusterClient with ClientBuilder class:
The thing is when we try to connect around 10k signalr client to hub, Orleans internally tries to have 10k cluster client grains, and tries to register the Client grain with indentifier like : messageHub:connectionId where messageHub = hub instance name and connectionId is siganlR client connection id, like this:
and we can see the Client grain on Dashboard:
After 10k such connections are getting timeout error from Silo host side, can you please help us with proper configurion for this
What is "ClientGrain"? If it's a part of Orleans, I'm not remembering it. Are you using a 3rd party SignalR integration package?
I guess it is this one: https://github.com/OrleansContrib/SignalR.Orleans/blob/2460d5d7340edd31cf5bf76e1bed170b5623f9eb/src/SignalR.Orleans/Clients/ClientGrain.cs
Thanks, @SebastianStehle
@kishor-raskar could you help us to understand the nature of these SignalR connections? Are they essentially idle connections, or are they active? What communication patterns are being used here, is every message being broadcast to 10K listeners?
cc @benjaminpetit - this may be another case where efficient broadcast would be helpful
Hi @ReubenBond, Thank you for your reply.
Basically, we are having one or more publishers (which are again the signalR clients) which are trying to publish events or text messages to all other clients (we are creating the Agent grains for those to persist their state). So we can say that one publisher is trying to send out messages to 10k agents for every configurable second (e.g. each publisher will send out the messages to the 10k Agents for every 30 seconds), We are also trying to send out heartbeat messages to SignalR Hub from clients in order to keep those connections alive (this heartbeat message is also we have configured to send for the configurable duration, e.g. for every 10 seconds we are sending the heartbeat message to Hub) .
Please let me know if you need more information from my side, if needed I will share the small POC of our implementation that might help us to come up with better suggestions from your side.
Apologies for the delayed response. How many publishers are there here? I'm trying to understand the raw messaging magnification which is required. It sounds like you should not publish from grains directly to each of these observers but instead publish only once per observing server and then fan out to the clients. That will likely reduce the amount of messaging required significantly and allow you to scale much further.
We're thinking of baking this pattern ("efficient broadcast") into Orleans, likely in the BroadcastChannel package (formerly known as SMS).
Thank you @ReubenBond for your reply, ideally, we would like to have 1 publisher for around 10k subscribers (Agents in our case and the number of agents per publisher that we want to optimize or come up with an optimum number), but we are not able to reach a stage where the publisher will send out the messages to subscribers it is failing when we are forming the connections to signalR hub (which in turns forms the connection with Silo through ClientGrain). When the number of connections goes beyond 10k we observe the latency from Silo and it doesn't respond within 10 sec causing time-out errors, which eventually not allowing the complete connections.
Is this BroadcastChannel package available with Orleans 4.0? or it is in pipeline?
We have similar problem. The scenario is very simple: broadcast a message to all online player grain. When we use MemoryStreamProvider / SMSProvider to implement this, subscribe a global stream when player grain OnActivated an unsubscribe when OnDeactivated, every subscribe / unsubscribe makes PubSubRendevous grain call WriteStateAsync, which writes all the producer and subscriber handle to database. If there are 10k players online and 1k player leave in several seconds, the database load would be very heavy. BroadcastChannel, on the other hand, only supports implicit subscription, any workaround to make it broadcast to all online player grains?