lighthouse Broadcasting same event for a lot of subscribers doesn't scale

Broadcasting same event for a lot of subscribers doesn't scale

Open palypster opened this issue 3 years ago • 25 comments

Broadcasting the same event to a higher number of subscribers (say 100+) takes a lot of time (couple of seconds). This is due to the Lighthouse Subscriptions design - each Subscriber has its own separate channel.

In some cases this is probably not ideal. Imagine implementing a chat platform with "chat rooms" feature. If someone sends a new message, others should received it ASAP and not wait until it's delivered to every-single-one over its own channel. One channel (per room) should be used instead.

In comparison, broadcasting an event directly using Pusher console, the event is delivered to all 100 subscribers in an instant (=it doesn't depend on the number of subscribers in the channel).

Not sure if it's a bug or a matter of design decision, but I would really appreciate your view on this topic.

Thank you very much!

Dec 30 '20 22:12 palypster

Combining channels in more flexible ways seems like a great enhancement for the use case you described and others. I am not looking to implement such a feature myself, but am willing to assist and work with you on a PR.

Dec 30 '20 23:12 spawnia

We need this very much and can allocate some resources on it. Unfortunately, we're lost where to even start, so we appreciate your help. Do you have an idea, how this should be even used from the developer perspective? Can you sketch something? Is it a global config, or custom directive (something like a @channel) or just an option inside the broadcast method?

Jan 02 '21 14:01 palypster

Everything related to subscriptions is located within https://github.com/nuwave/lighthouse/tree/master/src/Subscriptions

At a glance, the Subscriber class seems to assume the 1-1 correlation between channels and subscribers: https://github.com/nuwave/lighthouse/blob/3082329767f605f6b9270be8638fdea1ff965664/src/Subscriptions/Subscriber.php#L16-L21

This attribute is accessed in multiple places. I recommend you check those and figure out what needs to change in order to support other cardinalities between subscribers and channels. We would have to separate those concepts somehow and allow a mapping between them.

Jan 04 '21 15:01 spawnia

Okay, we've been digging a bit deeper. It looks what we want actually goes against how GraphQL is designed. Each subscriber defines the graphql query specifing what he would like to receive from subscription event. Therefore server (lighthouse) HAS TO prepare the response per each subscriber separately, it cannot be just send once. We could of course somehow group subscribers with the same query or something like that, but that's not the only issue. There are other features like authorization and filtering.

So to sum it up, I think this is no go.

But, the big delay when sending the very same event for a number of identical subscribers still bothers us. Maybe we can think about other way of how to improve the speed. Maybe we can instead of synchronously iterating over the collection of Subscribers and sending the event for each of them synchronously, we can rather spawn a new job and add a bunch of workers to work in parallel to speed things up. @spawnia what do you think?

Lighthouse already supports queues, but only in a "one broadcast, one job" scenario. If we could chunk this process into smaller jobs, maybe it could have a positive impact on speed. Let us know what do you think, if it's doable or if you already see any obstacles in our way. Thank you very much for you support!

Jan 06 '21 17:01 palypster

Good analysis.

I now remember a discussion about this, resulting in the following issue: https://github.com/webonyx/graphql-php/issues/674. If we could use that to determine equivalence of queries, we might optimize that part.

Processing individual subscribers in parallel could definitely help. You could just implement \Nuwave\Lighthouse\Subscriptions\Contracts\BroadcastsSubscriptions and bind that in your service provider and swap out queueBroadcast with a more fine grained split. If the results are good, I am open to including an implementation within Lighthouse.

Jan 06 '21 23:01 spawnia

Hey, thank you. I've been doing some testing. I have extracted the pusher calls into jobs and run them in parallel - when non-chunked, the improvement was not great (I guess the overhead of the initialising the job is comparable to the pusher call itself). But when run in chunks (size 10), the improvement is dramatic (almost order of magnitude).

But it's perfect that you have came with this related issue. I'll post a comment to that thread as well. We also try to test of optimising (just very simple) the duplicate queries and see how big impact that would have and let you know.

Jan 07 '21 09:01 palypster

Chunking seems good. That can be a configurable setting. Can you create a PR?

Jan 07 '21 11:01 spawnia

I have tested the query optimisation (using proposed "naive" approach in webonyx/graphql-php#674) (I accidentaly posted a comment there, but it was not related to webonyx/graphql-php so I rather deleted it). The naive approach using dumb string == string comparison of a query and args provides dramatic improvement. Maybe even more notable then chunking the pusher calls.

We will work on a PR (maybe two separate: chunked-broadcasting and query-caching).

Jan 07 '21 16:01 palypster

Hi, we have an update. We've been using both chunked-broadcasting and query-caching to improve performance dramatically, but still, Lighthouse was sending (chunked, in parallel) an event (from cache) to every single subscriber within his own channel.

It was bothering us, so we came with a pretty simple solution we would like to discuss with you. This solution even outperform the previous ones, so they are not in the game anymore. What we did was:

each subscriber with the same query and args registers to the exact same channel, we call it a room

protected function channelName(): string
{
    return 'private-lighthouse-room-' . md5(json_encode([(string) $this->query, $this->args]));
}

when broadcasting message we will send only one message to the same channel

$queryCache = [];

$this->iterator->process(
    $subscribers,
    function (Subscriber $subscriber) use ($root, &$queryCache): void {

        $key = $subscriber->channel;

        if (array_key_exists($key, $queryCache)) {
            // we skip sending, because we have already sent this message to this shared channel
        } else {
            $data = $this->graphQL->executeQuery(
                $subscriber->query,
                $subscriber->context,
                $subscriber->args,
                $subscriber->setRoot($root),
                $subscriber->operationName
            );

            $queryCache[$key] = true;

            $this->broadcastManager->broadcast(
                $subscriber,
                $data->jsonSerialize()
            );
        }
    }
);

(Note: the code is just a prototype, for the testing purposes, few other changes were needed to actually get it working)

So now, when new GQL clients subscribe for a subscription, they will be grouped into one channel if they listen for the same query and the same args.

Early performance tests preview, that this is so fast that it can handle soo many subscribers, that we couldn't even find the number high enough to notice ANY delay (actually we reached Pusher limits on our account). So it's like thousands times faster, literally. I mean, it's not surprising of course, sending only 1 event instead of N.

Quick sidenote: funny thing is that this solution was actually hinted in one of my comments in this discussion:

It looks what we want actually goes against how GraphQL is designed. Each subscriber defines the graphql query specifing what he would like to receive from subscription event. Therefore server (lighthouse) HAS TO prepare the response per each subscriber separately, it cannot be just send once. We could of course somehow group subscribers with the same query or something like that

but I immediately cut this path as way too complicated. 😜

Anyway, what is to discuss with you is, where do we continue from here. Lighthouse could provide a way for the developer to specify whether he wants to use:

either current Subscriber - 1:1 mapped to the channel - very handy for the situation when subscription streams the events specific to the User (so every user can get different data in the event)
this new RoomSubscriber - handy for the "room"-like scenarios, where every single subscriber (listening for same query & args) gets the same data

I think it would be best, if one could use both approaches in the same app, for different subscriptions, so I think we need some directive or something to differentiate. Do you have an idea how this could work? This is where we are loosing it. Good advice could navigate us.

Thanks a lot!

Jan 22 '21 21:01 palypster

protected function channelName(): string
{
    return 'private-lighthouse-room-' . md5(json_encode([(string) $this->query, $this->args]));
}

That looks like a great start. We can improve upon this in two ways:

use a normalized query string for comparison, we should be able to grab a hold of the query AST and use the printer to get a canonical formatting
normalize the args by sorting, as per the spec the order of given args is not significant

We do differentiate between public and private fields in the @cache directive, which can be specified with the private argument. We would have to do something similar for subscriptions and differentiate on a field-by-field basis.

Consider that a subsciption can query nested data deep into the schema, which might or might not contain private fields:

subscription Public {
  news {
    title
    author {
      name
    }
  }
}

subscription Private {
  news {
    title
    author {
      name
      isFriend
    }
  }
}

The safest route would probably be to assume fields are private by default and add a directive (@public?) to mark fields which are safe to share in rooms.

To further complicate things, arguments might also play a role. They might switch a field from public to private or the other way round.

subscription Private {
  news(excludeTopics: DISLIKED_BY_ME) { ... }
}

Another option would be to predefine field selections which are safe to share per subscription. So basically the largest/deepest possible query that contains only public data. Then, we can check if the clients subscriptions are a subset thereof.

Jan 23 '21 11:01 spawnia

For what it's worth, I wouldn't get too lost in the weeds if that leads to no progress on this. Most often identical queries are going to be coming from a client implementation with the query formatted in a fixed way, and it'll keep being sent with the same formatting nuances like whitespaces or whatever.

Apollo has a concept "Automatic persisted queries" which sort of has similar nuance concerns, but they basically leave that as a client implementation concern, which I think is correct. https://www.apollographql.com/docs/apollo-server/performance/apq/

Broadening the concern to account for all possible permutations with equivalent outputs would just add implementation complexity as well as increase execution time. So basically I'd take the query string, and an object container with all of the arguments, and hash them together and use that. Don't sweat the details.

In short I think the implementation proposed by @palypster is correct. Just generate all of the channel names this way and make grouping an inherent feature of how subscriptions are implemented, I see no reason for this to not be the normal/presumed operation mode. This also avoids clients accidentally creating a whole bunch of equivalent broadcast channels through connection drops and stuff like that. There will be cases where the responses should be unique based on context (example a subscription called "me" which has no inputs and returns changes to the user's own profile info) which could just be tagged as such using a directive or something, or allow the designer to contribute custom elements to the hash generation step, though I think such a non-deterministic design for an API query is bad form.

Apr 13 '21 19:04 GregPeden

I actually think both of these approaches are valid:

current approach is valid for scenario like subscription for new notifications - you cannot use "rooms" because every user should receive his own personalised notifications
new room subscriptions are valid for a group-chat (like i.e. channels in Slack)

Not sure which one is more common and which one should be the default. I supposed the first one, because otherwise it could be a breaking change for some use cases.

We've been using the "rooms" approach in the production for couple of months and it looks very promising, no issues so far.

Apr 14 '21 07:04 palypster

I agree with @SirLamer on skipping out on the query/args normalization for now. It is not required for this feature to work, and perhaps not even optimal for performance.

As @palypster noted, both private and public subscriptions are valid use cases. It should be possible to use both in the same application.

Because of the dynamic nature of queries, I think the correct approach for deciding if a subscription is private or public is to check the queried fields. Since leaking private results in a public room would be a big security problem, my instinct is to treat fields as private by default. Only subscriptions that contain exclusively fields explitly marked as public would be eligable for rooms. How about we introduce a new directive @public to serve as a marker?

Apr 14 '21 08:04 spawnia

BTW I just changed my username from SirLamer to @gregpeden, sick of the user name from when I was 10 years old. ;)

Apr 14 '21 09:04 GregPeden

Your point on presuming it to be private is apt, or perhaps allow a global setting which defaults to private.

I don't think "public" is the correct label, for example there's nothing which prevents grouping up queries behind an auth check. Also it's possible for two people with completely no interest in each other having subscribed to the same data stream, example two people just happen to be looking at the same data chart which is subscribed to updates. I would use another label like @group or @aggregate or something. Labeling it "public" implies it bypasses auth checks which I don't think is the intention. But perhaps I am over-thinking this.

If you like the idea of a global setting then there would also be a need to label subsriptions with @private or something like this.

Apr 14 '21 09:04 GregPeden

I am against introducing such a global setting, let's stick with security by default. One thing we can do to simplify the schema definition is to allow the directive to be set on a type, propagating it to all fields within that type.

How about @shared?

Apr 14 '21 11:04 spawnia

@shared sounds right.

When I was thinking about that directive, I think the most common scenarios would be I want to place the directive on the subscription so something like this:

type Subscription {
  newNotification: Notification
  newPrivateMessage: Message
  myStockPortfolio: StockUpdate
  newGroupMessage(channel: ID): Message @shared
  stockUpdate(ticker: Ticker): StockUpdate @shared
}

Correct? Note that the types Message and StockUpdate can be both private and shared for different subscriptions. Does it makes sense to you guys?

Apr 14 '21 13:04 palypster

@palypster assuming Message and StockUpdate or not scalar values, that would not suffice. Let me illustrate the rules I had in mind, given the following schema:

type Subscription {
  neverShared: ID
  alwaysShared: ID @shared
  maybeShared: Foo @shared
}

type Foo {
  shared: ID @shared
  notShared: ID
}

Whereas the case is clear for neverShared and alwaysShared, in the case of maybeShared it depends on the field selection:

subscription SharedBecauseOnlyUsingSharedFields {
  maybeShared {
    shared
  }
}

subscription NotSharedBecauseUsingUnsharedFields {
  maybeShared {
    notShared
  }
}

Apr 14 '21 13:04 spawnia

I see. This looks more complex and robust. Can you please provide a use case for this, I mean, when you would need this behaviour?

I'm thinking that in the real world scenarios, if you want to have a shared subscription you must mark ALL fields in the whole hierarchy as shared. If you i.e. introduce new field deep down within the type tree and forget to add @shared, whole subscription will switch to private mode, without you even notice, which could break your production.

Wouldn't be easier - and I'm not talking only about the complexity of the Lighthouse implementation, but also easier to comprehend by the dev team - to make the directive available only to the root field within Subscription type? I'm not sure if it is even possible, but it would definitely simplify things, it would be much more straightforward. Simple and clean.

Don't get me wrong, I'm just questioning, so we can find the best solution.

Apr 14 '21 13:04 palypster

If you i.e. introduce new field deep down within the type tree and forget to add @shared, whole subscription will switch to private mode, without you even notice, which could break your production.

Just adding a new field would not have any effect. Only once the field is queried would the subscription switch to private mode. That would not break anything, but possibly lead to performance degradation.

Let's suppose we only required marking the root subscription and the following change happens:

type Subscription {
  newPost: Post @shared
}

type Post {
  title: String
+ comments: [Comment]
}

type Comment {
  content: String
  author: User
}

type User {
  socialSecurityNumber: ID
}

The change seems harmless and it is not obvious at all it is going to cause your application to leak private user data. That is why I strongly believe that we must make @shared opt-in per field.

Apr 14 '21 13:04 spawnia

Wait a sec, this is not a good example. In your case even if the socialSecurityNumber is not @shared field, it still is exposed to EVERYONE. The only difference is that it will come trough separate pusher channels instead of one shared. So this is not a good example.

If all your fields are built in a REST-like manner, so that it always returns the same data, regardless on who's making the request (apart from authorization), then all these fields can be @shared.

The only fields that could not be @shared are fields whose response is different for a different user. Think about all those fields like me or myProfile or myNotifications, these are the only ones, I think. Or am I missing something?

Apr 14 '21 13:04 palypster

There are some things omitted from this schema.

In an actual application, socialSecurityNumber would only be accessible to authenticated users, and users would only be able to query the field for themselves. Even with those rules in place, in a subscription there might be leaked information. If the first user to resolve the subscription is coincidentally also a comment author, they would be able to access the field. Others would then receive exactly their data, without requiring authorization.

Apr 14 '21 14:04 spawnia

You're right. Okay, I'm with you.

Apr 14 '21 14:04 palypster

After thinking this through for a bit, I do get your point about marking entire subscriptions with @shared. We would have to put a big disclaimer on it, explaining that the nested fields must never hold private data. A benefit of this would be that there are no unexpected performance cliffs from clients adding a field. A potential downside is that this forces that part of the schema to be isolated and not tap into the entire data graph.

There is definitely a tradeoff to be had here. Perhaps there are even solutions orthogonal to the ideas discussed so far that circumvent this problem entirely or solve it more elegantly.

Apr 14 '21 16:04 spawnia

The problem you demonstrated actually is present also in the regular Queries, not only for Subscriptions (altough you're proposing to solve it by authorizing the field itself - I guess that on an accessor level of an Eloquent model, but I find it too complicated).

Maybe better approach is to separate GQL types of "User", so there is no risk of this kind of a leak:

type Subscription {
  newPost: Post @shared
}

type Post {
  title: String
 comments: [Comment]
}

type Comment {
  content: String
  author: User
}

type User {
   # note, that only shared fields are here
   first_name: String
   last_name: String
}

type MyProfile { # this can be the same Eloquent model as a User type, but it exposes some private fields
   first_name: String
   last_name: String
   socialSecurityNumber: ID
}


type Query {
  myProfile: MyProfile  # this query returns data based on current user
  post(id: ID): Post       # this one doesn't
}

We use this kind of approach a lot. I mean, we're not afraid of creating different GQL types for the same data, for different purposes.

But I understand that a developer can choose your approach, but yes, like you said in your last comment, maybe if we just put disclaimer on it, it would be enough and we can keep it more simple and clean.

Apr 15 '21 10:04 palypster

lighthouse lighthouse copied to clipboard

Broadcasting same event for a lot of subscribers doesn't scale

lighthouse
lighthouse copied to clipboard