matrix-spec-proposals
matrix-spec-proposals copied to clipboard
MSC2716: Incrementally importing history into existing rooms
A proposal for letting ASes specify event parents and timestamps when submitting events, letting them much more effectively incrementally insert past conversation history. This is getting increasingly topical given the need to bridge existing conversation archives from existing chat systems into Matrix. Fixes most of https://github.com/matrix-org/matrix-doc/issues/698 hopefully.
Homeserver implementations:
- Synapse: https://github.com/matrix-org/synapse/pull/9247 (and the many many other PR's)
- More generally, see the
RoomBatchHandler
- More generally, see the
Client implementations:
- Element: https://github.com/matrix-org/matrix-react-sdk/pull/8354
cc @tulir for feedback, as the main consumer of the ?ts= API today...
If I understand this correctly, this requires the application service to insert all the historical data before the user requests it.
Isn't it an idea to create a new querying API and request backlog from the AS, like homeservers currently do when they ask other federated servers for historical events? This would block the homeserver while the AS prepares the events, provides them to the homeserver using the APIs outlined in this MSC, and then the AS could return something like an array containing all the event IDs that were created during the request.
This would lead to an even cleaner SS-like integration with ASes, without creating a buttload of work for every AS. How much would they need to import, for example. Especially if the room has events going back ten years or something like that.
If I understand this correctly, this requires the application service to insert all the historical data before the user requests it.
yup, as per the Potential Issues section:
This doesn't provide a way for a HS to tell an AS that a client has tried to call /messages beyond the beginning of a room, and that the AS should try to lazy-insert some more messages (as per https://github.com/matrix-org/matrix-doc/issues/698). For this MSC to be properly useful, we might want to flesh that out.
What's currently holding this change back, if anything?
@Avamander Currently, I think it's mostly just waiting for a homeserver implementation to validate the MSC. I have one going in Synapse that's waiting for some review:
- https://github.com/matrix-org/synapse/pull/9247
- https://github.com/matrix-org/complement/pull/68
I wonder, will this also help to retrieve the room history when upgrading to a new room version?
@MightyCreak this might help, but such a feature is marginally more complicated than it might look on its surface (who fetches the history of the previous room?, etc.)
(Please add these kinds of comments into a thread in the future though, or ask/discuss them in #matrix-spec:matrix.org)
@MadLittleMods I believe this MSC has received the sanity review it was after and therefore am removing it from the SCT's backlog board. If this is false, please raise it in the SCT office for reconsideration.
What does this comment mean from linked PR. Could you give more information on what happened?
Abandoning PR as I don't see MSC2716 going further now that Gitter has fully migrated to Matrix
https://github.com/matrix-org/matrix-react-sdk/pull/8354#issuecomment-1480588494
@moritzdietz Based on your comment in #matrix-spec, I think you have a misunderstanding on how this relates to Gitter.
We were able to import all 141M messages from Gitter to Matrix without MSC2716. We used the single /send endpoint with the timestamp massaging ?ts=xxx query parameter split between a "historical" and live room.
The big drive to put effort into MSC2716 was the Gitter case but we were able to accomplish the Gitter migration without it in the end and there is no reliance on it now. Historical import within the DAG is still a very useful concept to have in Matrix but there are some roadblocks in the MSC before being viable:
- Event ordering over federation
- Currently, events in Synapse are sorted by
(topological_ordering, stream_ordering)wheretopological_orderingis justdepthand is baked into the event when it goes over federation. This means when we try to import betweendepth1and2, we can only rely onstream_orderingto sort between1and2. Sincestream_orderingis just dependent on when the server receives the event, the historical messages can easily get out of order. (some more info in the MSC) - To totally fix this problem, it would require a different graph linearization strategy. Perhaps we would do some online topological ordering (Katriel–Bodlaender algorithm) where
depth/topological_orderingis dynamically updated whenever new events are inserted into the DAG. This is something extremely sci-fi and a big task though.
- Currently, events in Synapse are sorted by
- Self-referential batches: There are some ideas in this open discussion but none stand out as great to use.
There have been lots of good learnings here but these shortcomings don't instill confidence to keep driving this forward without a underlying reason to do so. Hopefully we can come at this with some fresh ideas to solve these shortcomings when we need this sort of thing again.
Instead of leaving these experimental implementations languish around in Element and Synapse, I aim to remove them. For the Element case, the PR was never merged, so I could easily just close it.
@MadLittleMods Thank you Eric for clarifying. As you said, I did misunderstood that. I guess the missing link was this bit of information you just shared above which I haven't seen elsewhere.
I have to add that a bunch of people would really like this functionality in order to migrate to Matrix from other platforms, without losing years if not decades of message history.
Understandable that it's a difficult thing to implement, but it's be very useful to a lot of users.
It would be really unfortunate to hear that importing history won't be possible in the foreseeable future. This, like Avamander already mentioned, is for me the only thing which keeps me from adapting Matrix fully and convincing others to do so. Right now it is inevitable to keep every other client with valuable information installed as well if one wants to look up something historical. Most people I come across don't want to start anew but take all data with them, though this might be not the case in general.
@MadLittleMods I got a bit confused by your last comment - did I understand correctly that, since the event ordering only depends on topological_ordering and stream_ordering, the usage of the ?ts=xxx query parameter was only to achieve correct metadata for the messages, but doesn't influence ordering at all?
As a thought, is topological_ordering unsigned or signed? If it is signed, this might then make the retrospective insertion of non-interweaving history possible, i.e. if a user switches messengers without a period of using both, by simply giving the historical messages negative values to enqueue them before the messages already in the room.
In the case of unsigned, at least an import right at the beginning (before any more recent messages are sent in the room) would be possible without a change of how event ordering is handled, right?
In any case, can you point me to some PRs or discussions on how the current implementation of the event ordering came to be? When I first subscribed to this PR, the description had me thinking that event ordering would simply be done by timestamp, which would seem to solve any problems of inserting historical messages. But I'm sure that there are other issues which were circumvented by using the current implementation, and I'd like to understand this process to not make any pointless suggestions.
I got a bit confused by your last comment - did I understand correctly that, since the event ordering only depends on
topological_orderingandstream_ordering, the usage of the?ts=xxxquery parameter was only to achieve correct metadata for the messages, but doesn't influence ordering at all?
In the Gitter case, we started with a fresh room for the historical messages and imported one by one so the topological_ordering was correct. We also used /send?ts=xxx to make the timestamps correct. Then connected the historical and "live" room together with a m.room.tombstone and MSC3946 predecessor event. This functionality is completely separate from MSC2716 and works fine today.
As a thought, is
topological_orderingunsigned or signed?
See my previous comment: "topological_ordering is just depth and is baked into the event when it goes over federation"
You can see depth as part of the PDU (persistent data unit) in the spec: https://spec.matrix.org/v1.5/rooms/v10/#event-format-1
by simply giving the historical messages negative values to enqueue them before the messages already in the room.
Importing messages at the beginning of a room is only one use case. We also want to be able to import between any two events and even between already imported messages. One example is if you're importing a mail or newsgroup archive and you stumble across a lost mbox years later with a few more messages, you want to fill in that history.
If your use case is just one import blast at the beginning of a room, the way Gitter accomplished this works now and is a lot simpler (do that instead).
In any case, can you point me to some PRs or discussions on how the current implementation of the event ordering came to be? When I first subscribed to this PR, the description had me thinking that event ordering would simply be done by timestamp, which would seem to solve any problems of inserting historical messages. But I'm sure that there are other issues which were circumvented by using the current implementation, and I'd like to understand this process to not make any pointless suggestions.
Matrix is a DAG (direct acyclic graph) of events. depth being baked in is kinda a "get out of jail free" card on how to linearize the DAG.
This design decision is before my time and I don't know of any good references. Maybe someone in #matrix-spec:matrix.org has some context
- https://github.com/matrix-org/gomatrixserverlib/issues/187 is the best reference I know of for graph linearization (how to go from a DAG to a list of events in order) in general though
- Related event ordering issue: https://github.com/matrix-org/matrix-spec/issues/852
- Synapse docs on depth and stream ordering: https://github.com/matrix-org/synapse/blob/66ad1b8984eb536608e0915722c6a0b4493bb9df/docs/development/room-dag-concepts.md#depth-and-stream-ordering