WordPress-iOS Reader: Crash previewing site when post siteID is missing

Investigation

Based on longs, and testing the issue is reproducible when tapping on author header after blocking user. In this case, post.siteID is returned as nil, resulting in a crash.

Steps to reproduce

https://github.com/wordpress-mobile/WordPress-iOS/assets/4062343/7e3da114-ab28-40bf-860d-79da2b088034

Open Reader
Discover
Select an article
Open context menu on the header
Block user
Click on header
Crash

Sentry Issue: JETPACK-IOS-15VK

EXC_BREAKPOINT: siteID
  File "<compiler-generated>", in value
  File "ReaderDetailCoordinator.swift", line 400, in ReaderDetailCoordinator.previewSite
  File "ReaderDetailCoordinator.swift", line 686, in ReaderDetailCoordinator.didTapBlogName
  File "<compiler-generated>", line 685, in ReaderDetailCoordinator
  File "ReaderDetailNewHeaderView.swift", line 175, in ReaderDetailHeaderViewModel.didTapAuthorSection
...
(31 additional frame(s) were not displayed)

Dec 06 '23 12:12 sentry[bot]

This issue still occurs 51 times, affecting 42 users in version 24.7. Perhaps there are different paths that lead to siteID being nil? I'll reopen this issue, but feel free to close this again if this has been addressed somewhere else.

I've taken a look at the crash logs, and looks like it might involve some circular navigation, roughly like so:

Opening a Reader post (e.g., post id: 1).
Like the post.
Tap the blog header to open the site's stream.
Tap the same post (post id: 1).
Tap the blog header to go to the site stream again.
Tap the same post with id: 1 again.
Unlike the post.
Go back.

I've noticed some weird behavior where the loading never ends, or if you unliked from the site stream and go back, unliking from the Reader Detail does nothing.

May 10 '24 19:05 dvdchr

I reproduced this on 25.0 beta and confirmed that the crash event appeared in this Sentry issue.

Fresh install of Jetpack app
Open Reader and tap a post
Tap its heading
Repeat the above two steps a couple of times until the app appears to freeze on a ghost-loading screen
Tap back button
App crashed

I could reproduce this on 2 out of 2 attempts, although the second time needed more tapping.

May 27 '24 22:05 guarani

Here's a video of the 2nd attempt, which took 1m to crash:

https://github.com/wordpress-mobile/WordPress-iOS/assets/1898325/0d5200b1-6205-4891-a22d-59bf1eb3c246

May 27 '24 22:05 guarani

@dvdchr, @guarani

You're right. It's a more global issue across the Reader and I was able to reproduce it.

I haven't worked much with anything around Reader but as far as I understood the behavior, we fetch posts every time the header is tapped and the site post list is opened. After the fetch, we persist posts in the database, invalidating previously saved post objects.

When we tap back, ReaderDetailCoordinator and ReaderDetailToolbar end up holding invalid ReaderPost objects, and we get a crash when we try to access object properties.

May 29 '24 10:05 staskus

After the fetch, we persist posts in the database, invalidating previously saved post objects.

When we tap back, ReaderDetailCoordinator and ReaderDetailToolbar end up holding invalid ReaderPost objects, and we get a crash when we try to access object properties.

@staskus Yes, I believe you're right on the spot 👍 . Most of the Reader's fetching logic uses a drop-and-replace approach, in which previously stored posts are discarded for a given readerTopic. This is why going into the same site's feed multiple times invalidates the previous screen's data.

https://github.com/wordpress-mobile/WordPress-iOS/blob/fab138e42051867503fbb90308e3ac51a733b984/WordPress/Classes/Services/ReaderPostStreamService.swift#L32

I think the root cause is that our approach to syncing Reader content does not match the current UX architecture. There is nothing wrong with a drop-and-replace approach—but if we want to go that route, the UX hierarchy needs to be kept as flat as possible to prevent stale/dangling content. Meanwhile, in our case, the current Reader feed can interlink and be stacked onto each other (as described by the repro steps), and it could also be displayed in different places; for example, the Reader Search.

Some solutions that came to mind:

Before pushing content or on viewWillDisappear, we mark the content on the current screen as "stale". When the user comes back to this screen, we'll show a loading state and re-fetch the content again. Perhaps we could also conditionally mark it as stale based on the managed object's validity when the user returns to the screen.
Move to a DTO structure instead of hooking directly into an NSManagedObject or an NSFetchedResultsController. This way, we are sure that loaded contents will remain the same even if the data no longer exists; but this will need a larger refactor effort.
Update the Reader fetching logic to perform upserts instead of drop-and-replace. It is quite risky, though, as we need to watch out for duplicated content and stale data.

May 29 '24 10:05 dvdchr

Thanks for the reply, @dvdchr!

Some solutions that came to mind

Agree. I was thinking and experimenting locally with very similar solutions myself this morning. I wondered what could be achieved in the scope of a bug fix without introducing more complexity or expanding the scope too much.

I looked at a third option first, and I agree that this is risky. We would need to understand current logic well and why certain things were done the way they were done so we wouldn't introduce some more issues.

I considered the first option but I was concerned that on its own it may still leave some chances for corner cases to pop up. As you mentioned, it may require marking the object's validity.

Coupling DTO with viewWillAppear re-fetch of content could be a safer option but as you correctly pointed out may require a larger refactor effort.

Could we rely on postId rather than post object to access it from the db instead of passing the whole object to a coordinator? I assume it doesn't change in between post-fetch? I concede that it could still leave some corner cases where the post is deleted from the remote and is not found anymore when we come back in the flow.

May 29 '24 11:05 staskus

Could we rely on postId rather than post object to access it from the db instead of passing the whole object to a coordinator? I assume it doesn't change in between post-fetch?

Yes, the post ID should remain the same, and we'll need the site ID too. But yes, relying on these IDs should be safe enough. 👍

I concede that it could still leave some corner cases where the post is deleted from the remote and is not found anymore when we come back in the flow.

That's true, but I think this is an "intended" error state—much like network failures or no results state. We can show an error state when a post is no longer available (perhaps due to the author deleting it or setting it to private). For comparison, on the web, here's what it looks like when trying to access a "dead link" where the site has been deleted:

In the meantime, I'll add this issue to our team's board and take a stab at it on my next rotation (that is, if you don't beat me to it! 😄). Thanks for investigating this @staskus, appreciate it!

May 29 '24 12:05 dvdchr

In the meantime, I'll add this issue to our team's board and take a stab at it on my next rotation

Thanks, @dvdchr, sounds good!

May 30 '24 05:05 staskus

FYI I re-assigned the Sentry issue to you @dvdchr. Feel free to change it again if needed 👍

Jun 03 '24 17:06 guarani

Sounds good to me. Thank you @guarani!

Jun 03 '24 17:06 dvdchr

Hey @guarani / @staskus 👋 , I've created #23317 to address this issue (and then some). I'd appreciate some extra pair of eyes to verify that the fix works well! Thanks before. 🙂

Jun 06 '24 20:06 dvdchr

FYI, based on @staskus' recommendation, I've closed #23317 and opened #23330 instead.

Jun 07 '24 17:06 dvdchr