HTTP Header last_modified BUG
HI. the cached page using WP Rocket not reflect the real last modified since date, but only when the new page cache are regenerated. For example: I write blog post for Christmas Wishes and i publish the post, the last modified since in HTTP header return a 24 December date. If i clean all W rocket cache, and regenerate it, the same url return a date after 24 December, and reflect when the cache are generated or (re-generaed). This is a very serious problem and SEO destructive bad practices. I'm waiting for a fix soon.
Hello @DarkNight97boss, we were discussing this approach internally lately. Could you elaborate on SEO destructive bad practices?
Hi, the blog article of my system administrator colleague illustrates the problem in detail, you can consult his article here: https://managedserver.it/perche-non-dovresti-usare-wp-rocket-a-causa-del -bug-last-modified/
@piotrbak
Any news about this seo issue?
@DarkNight97boss This header is in our plugin since the beginning, for around 10 years, using this header is a standard in caching. Your colleague contacted our support team and we discussed this issue. We contacted Google about this, I'll check if we received any feedback telling that it's a real problem.
Greetings! Found this open issue, wanted to check in if there have been any updates?
I believe the potentially "SEO destructive bad practice" @DarkNight97boss alluded to here is that an always-recent Last-Modified header may result in crawlers spending time re-crawling pages that have not changed, and that might impact the "crawl budget" of very large sites. Most sites probably have bigger issues to handle for SEO before this makes or breaks anything, but it's one of many technical considerations to prioritize for optimization.
I would be interested to know if anyone from Google responded to your inquiry, though I suspect they would say something along the lines of "this won't impact you unless you have a VERY large site." So few users would be significantly impacted by this issue.
Thanks so much!
Pinging @DahmaniAdame, do you recall the final summary of the talks?
@geofflambeth yes, indeed. If last-modified proves to be an issue, it will only impact medium to large websites.
@piotrbak @geofflambeth No straight answer from Google. They pointed us to their documentation. But there is nothing meaningful there :sweat_smile:
The only details I could get is crawlers will:
- Rely on
lastmodon sitemaps and not on the actuallast-modifiedresponse headers. - Finger print content by extracting modification dates from the page using structured data.
- Use
etagresponse header as a signal as well.
There is no mention of the last-modified header being involved.
To cibfurn that, this is a QnA with Martin Splitt (par of Google's team involved with Google bots and usually stepping in to answer SEO questions - https://youtu.be/am4g0hXAA8Q?t=164 (ironically called SEO mythbusting 😅)
@DarkNight97boss @geofflambeth do you have any documentation from search engine to share about last-modified being used by crawlers?
We are happy to consider a switch to etag, but still, it needs to be for solving an actual issue.
@geofflambeth yes, indeed. If
last-modifiedproves to be an issue, it will only impact medium to large websites.@piotrbak @geofflambeth No straight answer from Google. They pointed us to their documentation. But there is nothing meaningful there 😅
The only details I could get is crawlers will:
- Rely on
lastmodon sitemaps and not on the actuallast-modifiedresponse headers.- Finger print content by extracting modification dates from the page using structured data.
- Use
etagresponse header as a signal as well.There is no mention of the
last-modifiedheader being involved.To cibfurn that, this is a QnA with Martin Splitt (par of Google's team involved with Google bots and usually stepping in to answer SEO questions - https://youtu.be/am4g0hXAA8Q?t=164 (ironically called SEO mythbusting 😅)
@DarkNight97boss @geofflambeth do you have any documentation from search engine to share about
last-modifiedbeing used by crawlers?We are happy to consider a switch to
etag, but still, it needs to be for solving an actual issue.
https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget?hl=it#if-modified-since
I think some of the challenge here may come down to variable definitions of "SEO." As John Mueller uses it in the twitter post above, he explicitly says that the Last-Modified header is not a ranking signal ("your site won't rank lower") so it's "not bad for SEO." So if SEO is only about ranking (the SEOs I know seem to agree that it's not only ranking), that's where the conversation would end.
However, Mueller clarifies that it's "good to use last-modification date headers appropriately, as this helps with crawling efficiently." This lines up with Google's documentation regarding crawling of large sites, which states that "Google's crawling is limited by bandwidth, time, and availability of Googlebot instances. If your server responds to requests quicker, we might be able to crawl more pages on your site." The docs continue to describe that some Google crawlers may use the If-Modified-Since header on a request, and then respond to a 304 (Not Modified) response—saving usage of our servers and usage of Google's bandwidth, time, and availability to better reflect pages that have actually changed.
I've done some testing this morning on a couple clients' sites with WP Rocket installed, and am seeing that if I supply an If-Modified-Since request header (I'm using postman, but you could test with anything) identical to or newer than the Last-Modified date supplied by WP Rocket's Last-Modified header, I get that 304 response. However, if I pick an older date than the Last-Modified header—but still more recent than the last true update to the page—I get a normal 200 response status. Following the logic from Google's docs, this could lead to use of the limited "bandwidth, time, and availability of Googlebot instances."
When it would matter
Say I have thousands and thousands of pages, and they all are giving a Last-Modified header within the last day or so (presumably the last time the cache refreshed). Most pages haven't been edited for months or years, but I've got a couple hundred that were updated recently or are updated very frequently. Assuming use of the If-Modified-Since request header, Google will need to use some bandwidth to crawl, index, and ultimately rank the updated pages—but will end up using bandwidth to crawl and consider thousands and thousands of pages that have not changed. In addition to being a somewhat inconsiderate web citizen—signaling that content has changed when it has not—these pages now may be slower to index or update than I might wish. As discussed above, only relevant on very large sites and even then probably not a huge deal. But I believe worth considering to be good citizens of the web and to better accommodate the needs of very large sites.