go-res icon indicating copy to clipboard operation
go-res copied to clipboard

Use QueueChanSubscribe to support horizontal scaling on services using go-res

Open raphaelpereira opened this issue 5 years ago • 4 comments

Hi!

Today go-res subscribe to NATS using ChanSubscribe and documentation states that only one service can provide a resource.

My suggestion here is to use QueueChanSubscribe using the service name as queue name, so any number of instances of go-res services can be started and NATS will automatically load balance between them.

raphaelpereira avatar Feb 08 '19 11:02 raphaelpereira

Thanks for the input and the suggestion!

The sort of load balancing you suggest is doable for static resources, and the documentation should perhaps be update to clarify that this is the case.

But for resources that might change state, load balancing over a single NATS (cluster) will cause problems.

Assume we have a Resgate, a NATS server, and 2 instances of a service, and the service has a collection, example.list, which looks like this:

[ "one", "two", "three" ]

Let's say two call requests are sent almost at the same time, to add "four" and "five" to this list. And assume these requests are balanced onto our two different instances. Without going into details, I guess you see the race issue. And even if these services have a way of synchronizing the updates through a shared database or something, there is no guarantee that the add events will arrive to Resgate in the same order they were applied in the database, unless these events are sent out by a single instance. Thus, we will risk ending up with a corrupt state in Resgate's cache and subsequently in the clients.

Instead, the type of horizontal scaling that currently is supported by Resgate and the RES protocol, is to create multiple instance of the entire setup. So, you will have 2 x (Resgate+NATS+Service). Each Resgate would then be guaranteed to get their example.list events from a single service, and all that is left to do is to add a way for the services to synchronize.

I haven't had time or need to create it yet, or think it through completely, but I believe this synchronization would be possible to do in a rather generic/agnostic way, similar to the way Resgate is generic/agnostic.

jirenius avatar Feb 09 '19 22:02 jirenius

As I understand, the problem actually resides on the fact that Resgate keeps track of collections by means of an indexed add/remove event.

In my first contact with Resgate I found it difficult to understand why keep track of collections based on there order as there are much more complex issues regarding it, for example Query management. How can I infer the position of an object when removed using a query that potentially is part of another collection without a query or with different one? Today I explicit request a Reset on that resource (which is costy) but I could not imagine another way without implementing a query order table.

Why does the Add event receives an index instead of a Resource ID? If you use a resource ID, there will be an ordering problem, which could be solved with a specific ordering function applied to the collection on every event. This would also work on Remove events and I presume would allow this kind of scaling I am proposing without much impact.

I might be saying a lot of sh*t here too, sorry if that is the case.

raphaelpereira avatar Feb 10 '19 01:02 raphaelpereira

Race issue with Models

The race problem is unfortunately not only for collections. A model (eg. { "msg":"Hello, World!" }) that gets two set requests (eg. first { "msg": "Hello, Sweden!" }, then { "msg": "Hello, Brazil!" }) might end up racing on their way to Resgate. Even if "Hello, Brazil!" was second to be applied in the databse, and should be the last to be applied on Resgate, the response change event coming from the "Hello, Sweden!" call might be slower, and reach Resgate last, causing corrupt state.

Using version sequencing

The only way I know to solve this is by introducing version sequencing. I did consider it at an early stage, but dismissed the idea as it would add additional complexity to building microservices.

Instead, I saw that the horizontal scaling could rather be done by letting different clients get their data served from different services, but each single client would get all their events from a single source for as long as the connection remain.

Collection reference by index

In an early version of Resgate, I did use id to reference items in a collection. But why did I move away from that?

With id, I could only use Collections for lists of models with unique items. This would limit their usage, and make them less like arrays. And I realized I wouldn't lose any benefits from using index instead within the protocol. Let me explain.

Almost always, you want to refer to items in a collection by id, not by index, This is why ResClient has an optional idCallback property for collections, so that you can do:

var mymodel = mycollection.get(42)

And when you make a call to remove an item, you should remove it by id and not by index, or else you might request to remove item nr 1, which by the time the request reaches the service might have changed since someone else also removed item nr 1. So, you end up with both 1 and 2 removed. But the RES protocol doesn't prevent you from using id as parameter in calls. It only requires you to use index when sending add and remove events.

Using models as unordered lists

If you want unordered lists with id's, you can use a model to represent it.

Look at the Dynamic tab in the resgate-test-app as example. The service is in dynamic.js.

Queries

Are you using query resources? Respect! 😃 I am actually quite impressed on how well you know the workings of the protocol.

Queries are the trickiest part of the RES protocol. Either you can solve them the easy but more costly way, by using system.reset. Or you can use query events. You have an example of that as well in the resgate-test-app's notes.js. But query events needs a big guide section on the coming website all by its own.

But I didn't fully understand your question on how to "infer the position of an object". Could you give an example?

jirenius avatar Feb 10 '19 08:02 jirenius

After consideration, there might be times when you actually need to use ChanQueueSubscribe to allow multiple instances of the same service on a single NATS.

However, by running multiple instances of the same service, you may no longer use any event that describes mutations (change, add, remove), but must rely entire upon the reset (or system.reset) event.

Though, by changing to ChanQueueSubscribe, it might cause issues for those using go-res with a sharded solution.

I would rather see it being possible to, in the handler, configure if it should be using the ChanQueueSubscribe or just ChanSubscribe.

jirenius avatar Sep 06 '19 08:09 jirenius