element-x-android icon indicating copy to clipboard operation
element-x-android copied to clipboard

History missing in EXA public room

Open ara4n opened this issue 2 years ago • 31 comments

Steps to reproduce

the history was there, but when I left and came back to the app I watched it vanish

https://matrix.to/#/!aCvOWKNDeXXpPgquKq:matrix.org/$m-bhJJ48WIdjFMnEdDB50_hP5hpsqf5T3uTv3V-wP58?via=matrix.org&via=envs.net&via=element.io

Outcome

What did you expect?

History should never spuriously vanish

Your phone model

No response

Operating system version

No response

Application version and app store

playstore 0.1.2 (40001020)

Homeserver

pintobyte.com

Will you send logs?

Yes

Are you willing to provide a PR?

No

ara4n avatar Aug 17 '23 21:08 ara4n

Removing the needs rust label until we know enough to create an upstream issue.

jonnyandrew avatar Aug 25 '23 09:08 jonnyandrew

Assigned to me to check rageshakes

manuroe avatar Sep 13 '23 08:09 manuroe

I checked rageshakes we received. We have 2 persons experiencing problems on their own homeserver & SS proxy. No-one is complaining on matrix.org since we fixed problems initially reported on iOS in https://github.com/vector-im/element-x-ios/issues/1197. This issue might be a server version problem.

We are aware that the workaround we made in SDK can generate false negative limited but we have not seen problem yet. The true now fix is to implement limited on the proxy.

I am deprioritising the issue. We can really work on such issue by checking both backend and client logs.

manuroe avatar Sep 15 '23 15:09 manuroe

Ran into the issue on 0.2.0, custom homeserver too. I sent a bug report from EXA, hopefully you can connect it here. Faced the same on the version in Play Store at the time of writing (0.1.6, I think)

I opened the app right around 18:55 in the timezone on the log file: ~~ss-proxy.log~~

Unfortunately it's only INFO level, I'll raise the verbosity and see if I get the issue again and would be able to get the SS proxy logs. Proxy version 0.99.10

The last time I faced the issue on the older version, I think I also had posted something and someone replied, but didn't react to it at that time.

I had notifications turned off for the room at this time, if that makes a difference.

Edit: sorry for the just about unreadable log file with ANSI color escapes. Here's a fixed one:

ss-proxy-fixed.log

0ranki avatar Sep 19 '23 19:09 0ranki

I've seen multiple versions of this bug when I was testing on Conduit.

  1. Some messages of the history are missing, probably a single batch, but it continues normally after that
  2. I can only see a few most recent events and the history before that doesn't load at all, it's just white.

I can reproduce number 2 on both conduit and matrix.org by

  • fully closing the app
  • send two messages to my element x account in a DM
  • clicking on the notification to start the app again

timokoesters avatar Sep 22 '23 11:09 timokoesters

An idea raised when discussing this problem on matrix with @bnjbvr :

Can it be a late keys decryption issue? Like in this flow:

  • on app resume, /sync returns some messages but not the corresponding new keys
  • EAX decides to not display them as UTDs for an unknown reason
  • keys arrive
  • Messages are decrypted in the SDK
  • BUT EAX does not update the timeline with decrypted content. Maybe it updates the timeline items but they stay hidden

We start thinking this is an application problem because iOS do not have this issue.

manuroe avatar Sep 22 '23 11:09 manuroe

[edit: In my specific instance of this problem, I am 100% positive this is] Not a late decryption: there were messages that I had seen in the room before. I background the app. I receive a notification from the app for new messages in this room. I click the notification, thus enter the room. After entering the room, some messages I had seen before backgrounding are now missing.

If I get back to the room list, then into the room again, I can see those messages perfectly fine.

bnjbvr avatar Sep 22 '23 12:09 bnjbvr

Here's how I can reproduce the bug number 2 I mentioned. It does not work every time:

https://github.com/vector-im/element-x-android/assets/25297359/4df0cc1e-49bb-4c73-9ce9-0fe8ec58a3de

timokoesters avatar Sep 22 '23 12:09 timokoesters

@timokoesters Thanks for the repro. Is that with a Synapse + sliding sync proxy server?

bnjbvr avatar Sep 22 '23 13:09 bnjbvr

@bnjbvr Yes, the recording is from a matrix.org account

timokoesters avatar Sep 22 '23 13:09 timokoesters

Cool, thanks for confirming. Also concurrently I did reproduce the issue with my Synapse + sliding sync proxy instances, using your STRs. Thanks a bunch :pray:

bnjbvr avatar Sep 22 '23 13:09 bnjbvr

From watching lots of logs, I suspect the following happens:

  • a back-pagination may start in the room we're looking at
  • sliding sync starts, or was running in the background
  • server returns 1 event aka the latest message) that the SDK wasn't aware of; the sliding sync proxy is right, here, since we knew about all the previous messages
  • client-side limited computation concludes it may be a limited timeline: there's only one event in the response that it doesn't know about
  • that in turn clears all the timeline items while the timline is being back-paginated :boom:

With respect to the progression of the back pagination, the clearing of the timeline may happen before/after/in the middle of pagination, likely resulting in missing messages.

cc @jplatte

bnjbvr avatar Sep 22 '23 16:09 bnjbvr

That sounds realistic. So I think we have to cancel the ongoing back-pagination, if any, when the timeline is reset, right?

jplatte avatar Sep 22 '23 16:09 jplatte

If you click on a notification, you probably want to see the message context and not cancel backpropagation @jplatte

Would it be possible to recover from the limited timeline instead of reconstructing it? Or alternatively to reconstruct it in the background so the user doesn't notice.

timokoesters avatar Sep 22 '23 21:09 timokoesters

@timokoesters eventually, we will be able to handle limited server responses more gracefully than resetting the timeline. There are plans, but they're not simple because we're dedicated to doing things right.

jplatte avatar Sep 23 '23 06:09 jplatte

This was supposed to be fixed by https://github.com/matrix-org/matrix-rust-sdk/pull/2638, but apparently the issue still exists. Do we still think this is a Rust only issue? Is there anything we can do to help investigating this issue?

jmartinesp avatar Oct 06 '23 10:10 jmartinesp

A new rageshake from a room where the problem happened, one that I'm in or that is public, might be helpful. Also wouldn't hurt to add matrix_sdk_ui::timeline=trace to the tracing filter. Currently having a little bit of trouble finding the event IDs for events involved in previous rageshakes.

jplatte avatar Oct 06 '23 10:10 jplatte

Is this related to https://github.com/vector-im/element-x-android/issues/1281? I posted a sliding sync server log in there if helpful.

anoadragon453 avatar Oct 06 '23 10:10 anoadragon453

I initially thought so, but https://github.com/matrix-org/matrix-rust-sdk/pull/2638 should have fixed this if that was the case.

jplatte avatar Oct 06 '23 10:10 jplatte

Fixed in EXA 0.2.4

bmarty avatar Oct 12 '23 12:10 bmarty

Unfortunately I've experienced this or something very similar a few times today on https://github.com/vector-im/element-x-android/commit/4a7b40fe175eaabbea8cc020de842a79c06f5fc8 (debug build from GHA), which was pushed after releasing 0.2.4. I'll start submitting rageshakes for this again; by now I'm used to force stopping Element and restarting it to see missing messages. 😓

rom4nik avatar Oct 12 '23 16:10 rom4nik

Seems like multiple people are still seeing timeline gaps 🙁

jplatte avatar Oct 24 '23 07:10 jplatte

If anyone is seeing this issue with the latest nightly, please rageshake as we need new logs to investigate further

kittykat avatar Oct 26 '23 12:10 kittykat

Additionally, it would be helpful if bug reports included the event IDs of the last event before the gap + first event after the gap. (for public rooms, a screenshot also works)

jplatte avatar Oct 26 '23 16:10 jplatte

I just experienced this issue again and submitted a rageshake with event IDs

davidmehren avatar Oct 30 '23 12:10 davidmehren

This should not happen anymore. Reopen if it is not the case.

manuroe avatar Mar 27 '24 15:03 manuroe

Still happens to me in non-public rooms on 0.4.7.

Xiretza avatar Mar 27 '24 16:03 Xiretza

Unless this was supposedly fixed between 0.4.7 and latest develop, can somebody reopen?

Xiretza avatar Mar 28 '24 19:03 Xiretza

@Xiretza Would you have any steps to reproduce the issue?

bnjbvr avatar Mar 29 '24 10:03 bnjbvr

Not really, unfortunately, it just happens seemingly randomly every once in a while. Might have something to do with my slow-as-molasses sliding sync proxy.

When it happens, the very latest message is (edit: sometimes) visible, then a whole chunk is missing, then history continues as normal.

Xiretza avatar Mar 29 '24 10:03 Xiretza