matrix-spec-proposals icon indicating copy to clipboard operation
matrix-spec-proposals copied to clipboard

MSC2190: Allow appservice bots to use /sync

Open tulir opened this issue 4 years ago • 28 comments

Rendered

Signed-off-by: Tulir Asokan <[email protected]>

tulir avatar Jul 24 '19 19:07 tulir

afaik it was disabled only because nobody used it and it bitrotted, so rather than fix something nobody used, it was removed, especially as it made /sync even more complicated. i am well up for restoring it though if someone will use it (it was my idea in the first place, iirc)

ara4n avatar Jul 24 '19 22:07 ara4n

There's lots of green:

@mscbot fcp merge

turt2live avatar Aug 12 '19 15:08 turt2live

This FCP proposal has been cancelled by https://github.com/matrix-org/matrix-doc/pull/2190#issuecomment-806955883.

Team member @mscbot has proposed to merge this. The next step is review by the rest of the tagged people:

  • [ ] @dbkr
  • [ ] @uhoreg
  • [x] @turt2live
  • [ ] @ara4n
  • [x] @anoadragon453
  • [x] @richvdh
  • [x] @erikjohnston
  • [x] @KitsuneRal

Concerns:

  • whether this is something we should be encouraging ─ https://github.com/matrix-org/matrix-doc/pull/2190#discussion_r595444852

Once at least 75% of reviewers approve (and there are no outstanding concerns), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for information about what commands tagged team members can give me.

mscbot avatar Aug 12 '19 15:08 mscbot

I think this was disabled because it had a tendency to take out synapse, especially when the freenode bridge randomly did it. It may be less of an issue if we only fetch events for the AS virtual user, but even then I would imagine that if e.g. the freenode bridge did this then it'd take out one of our synchrotrons. This originally manifested as the fact that bridges would occasionally and randomly decide to call /sync and take out the homeserver, (which was doubly annoying as we didn't know that those calls were so heavy weight until they happened).

To flip this on its head: why do we want this? Would this be rendered useless if sometime in the future we mandated limits to sync (e.g. forcing lazy loading, or only returning the most recent ten rooms, etc)? Should we extend the AS APIs instead? Do we need to better document how to use the existing helper APIs?

In general I'd prefer when we created new APIs that they were designed to be "limited" from the get go, i.e. that they can be implemented such that they will always return at most N bytes of data, thus allowing us to avoid edge case "killer requests" that easily takes out the service. That clearly hasn't happened with /sync, but I'd rather not have ASes relying on being able to get all data at once.

That's not to say this shouldn't happen, but if we do implement /sync then I'd either a) want good reason to or b) make it clear that AS authors shouldn't necessarily rely on being able to get all their state at once via /sync

erikjohnston avatar Aug 12 '19 15:08 erikjohnston

AIUI, the proposal boils down to /sync and /events resolving each bot's as_token to the respective "@sender_localpart:example.org" by default instead of forcing to pass user_id=@sender_localpart:example.org. Unless I'm mistaking the freenode bridge was taking homeserver out because of timeline calculation for a huge number of users matching the namespace. The proposal explicitly rules this out, proposing to keep the same /sync semantics for users (authenticated by access_token) and bots (authenticated by as_token).

KitsuneRal avatar Aug 13 '19 06:08 KitsuneRal

@KitsuneRal tbh even if you only matched sender_localpart, the freenode bridge would match something like 10k rooms. Not that I think it's a huge concern, there is no reason for bridges to call /sync during normal operation. The only purpose I can see behind it is for debugging (and EDUs, perhaps. Though that's being solved separately)

Half-Shot avatar Aug 13 '19 09:08 Half-Shot

It's worth mentioning that even though Synapse's old functionality was to return /sync data of all appservice users when the appservice bot requested /sync, the Freenode bot is still in every single bridged Freenode room ever created, which would definitely be a lot of data.

While it would be nice to just extend the AS API to help here, we have to be careful to not forget about it as we continue to put more and more features down /sync and nowhere else.

anoadragon453 avatar Aug 13 '19 09:08 anoadragon453

honestly if the freenode bridge hits /sync, we've done something terribly wrong. Can we stop optimizing the proposal for a bridge that won't use sync?

turt2live avatar Aug 13 '19 14:08 turt2live

honestly if the freenode bridge hits /sync, we've done something terribly wrong. Can we stop optimizing the proposal for a bridge that won't use sync?

There's nothing in this MSC that says that bridges shalt not use this endpoint? If bridges are meant to use it then we need to worry about big bridges, if bridges aren't meant to use it then we should figure out what it is meant to be used for.

This feature was removed because it was causing major service outages, I think for this to be re-added we need to have some guarantees that it won't cause similar issues.

erikjohnston avatar Aug 16 '19 09:08 erikjohnston

There shouldn't be anything in the MSC/spec that says things should(n't) use a particular endpoint. I don't think it's realistic for us to solve operational problems in the spec. From an ops perspective, I wouldn't recommend that a server owner deploy a popular bridge which uses /sync for performance reasons (unless their server implementation is optimized for it, somehow). Similarly, I wouldn't expect that a bridge even touch /sync if it knows it is the wrong choice for what they're doing.

It's a strange restriction on appservices: the sender_localpart can't sync, but a user in an equal number of rooms which lives in the users namespace can? Do we just need another MSC for an error code on /sync to say "nah, your account is too large"?

turt2live avatar Aug 16 '19 14:08 turt2live

Operational issues are caused by the fact the spec says we must support unbounded data fetching requests, and so the spec can fix those operational issues by not mandating we support those APIs. I don't think it is unreasonable to take into account implementation and operational concerns when considering what to put in the spec?

Again, I would quite like to hear use cases for /sync. It seems unhelpful to have APIs that only work if the appservice are "small" enough? Does that mean if a bridge/bot you're hosting becomes popular it suddenly stops working because it was relying on /sync?

erikjohnston avatar Aug 16 '19 15:08 erikjohnston

One use case would be go-neb: it doesn't use appservice transactions but fits really nicely for reserving a namespace of users. Because the spec says that the sender_localpart is required and should not be in the namespace, the registration should be reserving some arbitrary ID and some namespace of users for it to use. Although it doesn't make use of it now, it doesn't seem unreasonable that the sender_localpart user become a management interface for the bots running in the namespace. For ease of development, that would be a great time to use sync as it is currently used for everything else (the management bot becomes a plug-and-play bot like the rest of them).

Admittedly, this is not a strong use case. If we're adamant that appservices should not be syncing then we should disable sync for all the users in the namespaces, not just the sender_localpart. It's a silly restriction to block one user but not the others which can easily be in just as many rooms.

turt2live avatar Aug 16 '19 15:08 turt2live

One use case would be go-neb: it doesn't use appservice transactions but fits really nicely for reserving a namespace of users. Because the spec says that the sender_localpart is required and should not be in the namespace, the registration should be reserving some arbitrary ID and some namespace of users for it to use. Although it doesn't make use of it now, it doesn't seem unreasonable that the sender_localpart user become a management interface for the bots running in the namespace. For ease of development, that would be a great time to use sync as it is currently used for everything else (the management bot becomes a plug-and-play bot like the rest of them).

I don't really follow why the management user ID would want to call sync in that case?

Admittedly, this is not a strong use case. If we're adamant that appservices should not be syncing then we should disable sync for all the users in the namespaces, not just the sender_localpart. It's a silly restriction to block one user but not the others which can easily be in just as many rooms.

To a certain extent I agree, though I think its more likely that the sender_localpart will be in a lot of rooms compared with their ghosts.

erikjohnston avatar Aug 28 '19 12:08 erikjohnston

If bridges are meant to use it then we need to worry about big bridges, if bridges aren't meant to use it then we should figure out what it is meant to be used for.

Bridges don't need to be big. This is meant as a first step for user-hostable appservices

tulir avatar Aug 28 '19 12:08 tulir

Bridges don't need to be big. This is meant as a first step for user-hostable appservices

Absolutely true, but they do often have a tendency to grow big after a while.

What do you mean by user-hostable appservices?

erikjohnston avatar Aug 28 '19 12:08 erikjohnston

What do you mean by user-hostable appservices?

Appservices hosted on the user's machine instead of on the server. When this is combined with future things like e2ee and some kind of dynamic appservice registration, users could have their own bridges without trusting the server. Even without e2ee, it could be used to avoid being banned from a hostile platform due to many users using the same bridge.

tulir avatar Aug 28 '19 14:08 tulir

any chance we could get this merged @ara4n?

ericmigi avatar Feb 16 '20 19:02 ericmigi

@ericmigi sorry - missed this. as per our chat, it sounds like you're not blocked on it currently, but it's still on our radar as part of upcoming catchups on our MSC backlog.

ara4n avatar Mar 29 '20 01:03 ara4n

The proxy I made for dynamic appservice allocation also removes the immediate need for this change, but it would still be nice to have so other people could use end-to-bridge encryption without having to use the proxy or other hacks.

tulir avatar Mar 29 '20 17:03 tulir

I see two separate issues to be resolved here:

  1. There's an odd discrepancy between allowing virtual users to call /sync and disallowing the bot user from calling it. We should be consistent on allowing all or none to call /sync.

  2. The benefits for allowing appservice users to call /sync are not clear. The biggest usecase I see is sending EDUs such as bridging typing notifications and read receipts over the bridge. But this can also be achieved by sending that information in appservice transactions. Are there any other use-cases?

It seems that having a single stream of transactions from HS->AS would be simpler to reason able and use less resources than separate /sync calls for each individual appservice user.

anoadragon453 avatar Jul 25 '20 19:07 anoadragon453

We should be consistent on allowing all or none to call /sync.

Arbitrary limitations are bad, so all should be allowed.

Are there any other use-cases?

I guess to_device events aren't really counted as EDUs.

Sending EDUs and to-device events down the AS transaction stream would be nicer of course, but it's also much harder to implement server-side. The proposal to send EDUs to appservices (#2409) is even more stuck than this one and nobody has even attempted to implement it.

tulir avatar Jul 25 '20 20:07 tulir

Sending EDUs and to-device events down the AS transaction stream would be nicer of course, but it's also much harder to implement server-side. The proposal to send EDUs to appservices (#2409) is even more stuck than this one and nobody has even attempted to implement it.

Note that matrix-org/matrix-spec-proposals#2409 is much less stuck these days (thanks to server and bridge side implementations!). to-device messages are also planned to be sent to appservices through a similar mechanism (though matrix-org/matrix-spec-proposals#2409 does not mention them).

Which all seems like we're trending away from appservices using /sync and instead passively receiving events through the Application Service API instead?

anoadragon453 avatar Jan 26 '21 17:01 anoadragon453

@mscbot concern whether this is something we should be encouraging ─ https://github.com/matrix-org/matrix-doc/pull/2190#discussion_r595444852

erikjohnston avatar Mar 23 '21 17:03 erikjohnston

so we've discussed this as a spec core team and have come to the conclusion that Matrix isn't quite ready for this change just yet. Appservices tend to have extremely large accounts for the sender_localpart user, and regardless of server implementation that means /sync is potentially harmful to a server's survival.

While the spec doesn't typically concern itself with the performance of endpoints to drive whether or not they are allowed, in this case it's clear that the appservice spec is trying to give a different route for appservices to use. We think that the immediate use case would be solved by other MSCs (https://github.com/matrix-org/matrix-doc/pull/2409 for example), and that /sync needs more work before we can consider large appservice accounts being able to sync. There are workarounds to this spec limitation currently possible, which allows appservices to sync if they really want to.

For sync improvements, we're looking at something like paginated sync (https://github.com/matrix-org/matrix-doc/issues/3076 - I thought we had an issue for this already, but apparently not) where we can reduce the impact of large accounts syncing. We're less concerned with users in an appservice namespace syncing because they are typically in fewer rooms, though we acknowledge the inconsistency here.

Process-wise, this means we're changing tracks to postpone this MSC until after "sync improvements" are made, where we can re-evaluate this MSC's purpose in that world.

@mscbot fcp cancel

turt2live avatar Mar 25 '21 15:03 turt2live

@mscbot fcp postpone

turt2live avatar Mar 25 '21 15:03 turt2live

Team member @turt2live has proposed to postpone this. The next step is review by the rest of the tagged people:

  • [ ] @dbkr
  • [x] @uhoreg
  • [x] @turt2live
  • [x] @ara4n
  • [x] @anoadragon453
  • [x] @richvdh
  • [x] @erikjohnston
  • [ ] @KitsuneRal

Once at least 75% of reviewers approve (and there are no outstanding concerns), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for information about what commands tagged team members can give me.

mscbot avatar Mar 25 '21 15:03 mscbot

:bell: This is now entering its final comment period, as per the review above. :bell:

mscbot avatar Apr 06 '21 15:04 mscbot

The final comment period, with a disposition to postpone, as per the review above, is now complete.

mscbot avatar Apr 11 '21 15:04 mscbot