wordpress-seo
wordpress-seo copied to clipboard
Investigate how to verify indexable content after re-activating plugin and remedy obsolete/stale content
We have a couple of situations now where we clean up wrong indexables via the cleanup routine and when the plugin is activate. We have a background index to index pages that were created while the plugin is not activated. BUT we don’t have a system (as far as we can see) that verifies if the indexables you have are still correct. So when I have a indexable with permalink /foo for page 100. I disable the plugin, change the permalink from /foo to /bar then reactivate the plugin. The sitemap will show /foo which does not exists anymore.
This task is about finding the best possible solution to deal with that.
Steps to reproduce: This inconsistency is not only gonna affect indexable sitemaps in the future, but is affecting us in our production versions as well. An area where we're currently affected that I have confirmed:
- Create a post with
/foo
permalink while Yoast is enabled - Deactivate Yoast
- Make that permalink into
/bar
- Re-activate Yoast
- The canonical tag in the frontend of that post is still
/bar
although it is now obsolete - Same for the og:url tag
- There's stuff in the schema as well that still point to
/bar
such as the URL of the WebPage (and others, but most of them are just IDs, which makes it less troubling?)
Possible solutions Copying from @thijsoo 's comment below:
- Maybe save an option when deactivating the plugin to indicate from when to start checking the items. and then verifying with the indexable if something needs to be changed. Then we need to rebuild that indexable and we are done. Probably in a cron or a background job.
- detecting a likely out-of-sync state in some way and suggest people to perform a complete reindex only in that case?
- a combination of the above, or any other route
More context here.
proposed part solution can be: maybe save an option when deactivating the plugin to indicate from when to start checking the items. and then verifying with the indexable if something needs to be changed. Then we need to rebuild that indexable and we are done. Probably in a cron or a background job.
Another option for a solution can be: what about just detecting a likely out-of-sync state in some way and suggest people to perform a complete reindex only in that case? Little bit of context: https://yoast.slack.com/archives/C03KU0EHCNQ/p1681897001789349
Current proposed plan: The Plan
- Upon deactivation, we're going to set a deactivation timestamp in the db
- Upon re-activation, we're going to set a re-activation timestamp in the db and then we're going to schedule a twofold scheduled action (or 2 separate ones) that will:
- detect and correct stale indexables for objects that we're able to see their modification dates (posts, pages, etc.). We're going to go through only the objects that were modified after our deactivation and before our activation.
- detect and correct stale indexable for objects that we're not ale to see their modification dates (terms, users-probably, etc.). We're going to through every object of this group unfortunately (and we'll have to establish a chunked process for doing so)
- keep a flag in the db that indicates we're under-stale-data-detection that will be cleared when we're done. This flag will be useful when indexable sitemaps are rolled out, so that we revert back to non-indexable sitemaps when the flag is active (that's something to be verified with Jono, if it's good enough).
The Drawbacks
- Until those scheduled actions are finished, we are not sure if we have stale data or not. That (combined with the fact that deactivation/re-activation is a common part of support processes, like conflict tests) is why we can't bluntly purge stuff/prompt users to re-index
- Until those scheduled actions are finished, we do run the risk of showing wrong SEO data in the frontend, much like the replication steps of the issue describe. But it sure is an improvement of our current state. And we make sure to not extend these kind of issues to the indexable sitemaps, with the use of that flag we mentioned above.
- what happens with current users - is it worth scheduling those detection/correction actions for ALL users when we release this? Because there might be already users with stale data because there have been deactivations/reactivations in the past. Do we try to rectify past mistakes (with the caveat of scheduling 13M heavy scheduled actions across the internet for a niche case)?
Whilst that flag is active, we'll need to fall back to native WP functions for everything. We can't afford to, e.g., have conflicting or incorrect info in the XML sitemap, vs canonical, vs schema, etc.
I agree I think it is good to mention that the conflicting results is the current way this works, and as far as we are aware we don't have support tickets about this.
For now parking this for a longer period since we are not actively working on adding more things to the indexables. When we pick this back up we should revisit if the solution is at all what we want or just restart this.