azure-signalr
azure-signalr copied to clipboard
Scale signalr service and app server causing clients disconnect
We first have 2 Unit of signalr service running, and have all clients connect to our app server & signalr service. We scale signalr service to 5 Unit from 2 Unit. During this process, all our clients disconnect. We expect clients should not disconnect, especially we scale up.
Similar happens when we scale out or scale down or restart our app server. We see clients connections lost and not able to reconnect for pretty long (a few minutes) (seems due to app server side error: Client negotiate failed: Azure SignalR Service is not connected yet, please try again later. or may be other error) time.
The environment we're on: app server: Microsoft.Azure.SignalR 1.0.8. Running on 1 instance of S1 app service plan. signalr service: 1 Unit Standard.
We're seeing the same thing - is this supposed to happen? When we have 50k users and we scale out we then get 50k disconnections and 50k clients simultaneously reconnecting, causing a DOS attack on our servers. This seems like a monumental flaw in the logic.
Had this back from microsoft:
I'v checked your question, you have a connection dropout issue while scale out.I very sorry that this dropout is by designed. It's because of the ASRS architecture. When scale out, it create a new instance with more resource to support your business. For the old resourceswill be removed. When removed, all connections will be lost. We've awared this impacted, but somehow it bind to the architecture design. So we don't have plan to improve this functionality.
@ispysoftware Although connections are dropped during scaling, it's not happening suddenly. The higher unit count you're using, the total progress will be longer. You can think it as a rolling drop and reconnect. By the way, we're improving the architecture to make the whole progress more smooth. And you can expect there will be no connection drop in some cases of scaling. E.g. Scaling from unit 1->2 has no connection drop, but from 1 -> 100 will have a smooth reconnect progress.
any idea on when that will be available? We're working on a serverless architecture with installed software clients so need to get this right as updating the client code will be very difficult. I just tested here and scaling 1-2 on serverless kicks everyone off.
I have the same issue... How can you deisgn something like that? When you have 50k users, you need to disconnect all? I use this service and we have a very changing number of user. Then I create an Azure Function to adapt the number of unit... Hope you will fixed it very fast... Because migrating from 50 to 100k take a very long time...
@zackliu As I told you yestersday, I have a very high peak of users. I implemented a Azure functions to scale up. The use I have today ( one hour ago), The service was totally out ( error 500: azure signalr service is not connected yet, please try again later.) It's not really something I can call "smooth". I cannot know in advance how many user I will have maybe 1k, maybe 50k... I need a way to auto scale!

@alexveya I'm now working on that improvement, but it can't be very fast as it has a lot of work to do. As I know, scaling shouldn't close all the server connections to make the service totally out. Could you send me your SignalR Service's resourceId (by email maybe), I can search the details about it.
@zackliu Today, the service is not working anymore... I get an 503 error. And cannot use it again. I move back to 1 unit, start again to work. Then I move again to 50k... more than one hour later, still updating. Can you help me very quickly? It's the most important day of the year for my company!
@alexveya Could you give me you ResourceId, thus I can dig deep for you. Waiting for your email.
This scaling issue with signalr is a massive, massive problem. If you could engineer it to actually scale to demand without disconnecting everyone you'd have a good product but as it stands from a software engineering perspective it's a total nightmare to use. A surge in demand is exactly when you don't want your service dropping out yet this is basically designed to do exactly that. How this model got through planning just boggles my brain.
@zackliu I wrote an email yesterday... at the email on your profile
Just about to perform this scale out from 2 --> 5 units. And I really cant cause any issues for the connected users. Whats the current status of this issue ? Should I be worried, any recent stories from the trenches ?
Hello
I still have issue switching from 2 to 5 units. I lost all my connection ( more than 1600 clients ) when I scale. It take about 30 minutes for my server to be able to connect again to signalR.
I really need to have a smooth way to change signalR scale when I need. Scale from 1 to 2 units works well, but not the 2->5!
My utilization of SignalR is near 0 all the year, but maybe 50 days a year, I need more. I really need something to scale automatically, I don’t want to set 10 units and don’t used it most of time ( it’s too expensive!).
How can I tell to my management that I think I choose the right technology, but Microsoft implement not in a good way?
Regards
From: Johan Eriksson [email protected] Sent: mardi, 14 janvier 2020 12:49 To: Azure/azure-signalr [email protected] Cc: alexveya [email protected]; Mention [email protected] Subject: Re: [Azure/azure-signalr] Scale signalr service and app server causing clients disconnect (#486)
Just about to perform this scale out from 2 --> 5 units. And I really cant cause any issues for the connected users. Whats the current status of this issue ? Should I be worried, any recent stories from the trenches ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/azure-signalr/issues/486?email_source=notifications&email_token=AAUCU32ZE4KXAKCH6LYAUR3Q5WRDJA5CNFSM4HFTQQE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4KKUY#issuecomment-574137683 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUCU373BQJLTOFMXNYNGVTQ5WRDJANCNFSM4HFTQQEQ . https://github.com/notifications/beacon/AAUCU35MNHYDDNLU5IPPOGLQ5WRDJA5CNFSM4HFTQQE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4KKUY.gif
They solved the disconnection issue when scaling, according to https://github.com/Azure/azure-signalr/issues/1096#issuecomment-878387639
And for the auto-scaling feature they are working on it, and in the mean-time here are 2 ways of doing so:
- Using Powershell function https://gist.github.com/mattbrailsford/84d23e03cd18c7b657e1ce755a36483d
- Using Logic App https://staffordwilliams.com/blog/2019/07/13/auto-scaling-signalr-service-with-logic-apps/