azure-signalr icon indicating copy to clipboard operation
azure-signalr copied to clipboard

Scale signalr service and app server causing clients disconnect

Open Jun2014 opened this issue 6 years ago • 14 comments

We first have 2 Unit of signalr service running, and have all clients connect to our app server & signalr service. We scale signalr service to 5 Unit from 2 Unit. During this process, all our clients disconnect. We expect clients should not disconnect, especially we scale up.

Similar happens when we scale out or scale down or restart our app server. We see clients connections lost and not able to reconnect for pretty long (a few minutes) (seems due to app server side error: Client negotiate failed: Azure SignalR Service is not connected yet, please try again later. or may be other error) time.

The environment we're on: app server: Microsoft.Azure.SignalR 1.0.8. Running on 1 instance of S1 app service plan. signalr service: 1 Unit Standard.

Jun2014 avatar Apr 12 '19 19:04 Jun2014

We're seeing the same thing - is this supposed to happen? When we have 50k users and we scale out we then get 50k disconnections and 50k clients simultaneously reconnecting, causing a DOS attack on our servers. This seems like a monumental flaw in the logic.

ispysoftware avatar Sep 06 '19 07:09 ispysoftware

Had this back from microsoft:

I'v checked your question, you have a connection dropout issue while scale out.I very sorry that this dropout is by designed. It's because of the ASRS architecture. When scale out, it create a new instance with more resource to support your business. For the old resourceswill be removed. When removed, all connections will be lost. We've awared this impacted, but somehow it bind to the architecture design. So we don't have plan to improve this functionality.

ispysoftware avatar Sep 06 '19 07:09 ispysoftware

@ispysoftware Although connections are dropped during scaling, it's not happening suddenly. The higher unit count you're using, the total progress will be longer. You can think it as a rolling drop and reconnect. By the way, we're improving the architecture to make the whole progress more smooth. And you can expect there will be no connection drop in some cases of scaling. E.g. Scaling from unit 1->2 has no connection drop, but from 1 -> 100 will have a smooth reconnect progress.

zackliu avatar Sep 06 '19 07:09 zackliu

any idea on when that will be available? We're working on a serverless architecture with installed software clients so need to get this right as updating the client code will be very difficult. I just tested here and scaling 1-2 on serverless kicks everyone off.

ispysoftware avatar Sep 06 '19 09:09 ispysoftware

I have the same issue... How can you deisgn something like that? When you have 50k users, you need to disconnect all? I use this service and we have a very changing number of user. Then I create an Azure Function to adapt the number of unit... Hope you will fixed it very fast... Because migrating from 50 to 100k take a very long time...

alexveya avatar Sep 23 '19 14:09 alexveya

@zackliu As I told you yestersday, I have a very high peak of users. I implemented a Azure functions to scale up. The use I have today ( one hour ago), The service was totally out ( error 500: azure signalr service is not connected yet, please try again later.) It's not really something I can call "smooth". I cannot know in advance how many user I will have maybe 1k, maybe 50k... I need a way to auto scale!

image

alexveya avatar Sep 24 '19 10:09 alexveya

@alexveya I'm now working on that improvement, but it can't be very fast as it has a lot of work to do. As I know, scaling shouldn't close all the server connections to make the service totally out. Could you send me your SignalR Service's resourceId (by email maybe), I can search the details about it.

zackliu avatar Sep 24 '19 13:09 zackliu

@zackliu Today, the service is not working anymore... I get an 503 error. And cannot use it again. I move back to 1 unit, start again to work. Then I move again to 50k... more than one hour later, still updating. Can you help me very quickly? It's the most important day of the year for my company!

alexveya avatar Sep 25 '19 10:09 alexveya

@alexveya Could you give me you ResourceId, thus I can dig deep for you. Waiting for your email.

zackliu avatar Sep 25 '19 12:09 zackliu

This scaling issue with signalr is a massive, massive problem. If you could engineer it to actually scale to demand without disconnecting everyone you'd have a good product but as it stands from a software engineering perspective it's a total nightmare to use. A surge in demand is exactly when you don't want your service dropping out yet this is basically designed to do exactly that. How this model got through planning just boggles my brain.

ispysoftware avatar Sep 25 '19 13:09 ispysoftware

@zackliu I wrote an email yesterday... at the email on your profile

alexveya avatar Sep 25 '19 14:09 alexveya

Just about to perform this scale out from 2 --> 5 units. And I really cant cause any issues for the connected users. Whats the current status of this issue ? Should I be worried, any recent stories from the trenches ?

jedjohan avatar Jan 14 '20 11:01 jedjohan

Hello

I still have issue switching from 2 to 5 units. I lost all my connection ( more than 1600 clients ) when I scale. It take about 30 minutes for my server to be able to connect again to signalR.

I really need to have a smooth way to change signalR scale when I need. Scale from 1 to 2 units works well, but not the 2->5!

My utilization of SignalR is near 0 all the year, but maybe 50 days a year, I need more. I really need something to scale automatically, I don’t want to set 10 units and don’t used it most of time ( it’s too expensive!).

How can I tell to my management that I think I choose the right technology, but Microsoft implement not in a good way?

Regards

From: Johan Eriksson [email protected] Sent: mardi, 14 janvier 2020 12:49 To: Azure/azure-signalr [email protected] Cc: alexveya [email protected]; Mention [email protected] Subject: Re: [Azure/azure-signalr] Scale signalr service and app server causing clients disconnect (#486)

Just about to perform this scale out from 2 --> 5 units. And I really cant cause any issues for the connected users. Whats the current status of this issue ? Should I be worried, any recent stories from the trenches ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/azure-signalr/issues/486?email_source=notifications&email_token=AAUCU32ZE4KXAKCH6LYAUR3Q5WRDJA5CNFSM4HFTQQE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4KKUY#issuecomment-574137683 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUCU373BQJLTOFMXNYNGVTQ5WRDJANCNFSM4HFTQQEQ . https://github.com/notifications/beacon/AAUCU35MNHYDDNLU5IPPOGLQ5WRDJA5CNFSM4HFTQQE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI4KKUY.gif

alexveya avatar Mar 02 '20 07:03 alexveya

They solved the disconnection issue when scaling, according to https://github.com/Azure/azure-signalr/issues/1096#issuecomment-878387639

And for the auto-scaling feature they are working on it, and in the mean-time here are 2 ways of doing so:

  • Using Powershell function https://gist.github.com/mattbrailsford/84d23e03cd18c7b657e1ce755a36483d
  • Using Logic App https://staffordwilliams.com/blog/2019/07/13/auto-scaling-signalr-service-with-logic-apps/

Meetsch avatar Oct 06 '21 07:10 Meetsch