MarginaliaSearch icon indicating copy to clipboard operation
MarginaliaSearch copied to clipboard

RSS fetcher doesn't fetch conditionally

Open michaelnordmeyer opened this issue 9 months ago • 8 comments

The RSS fetcher should be doing conditional requests, according to https://github.com/MarginaliaSearch/MarginaliaSearch/issues/136#issuecomment-2563756729, but apparently it doesn't:

$ ll /var/www/michaelnordmeyer.com/feed.xml*
-rw-r--r-- 1 user group 136066 Apr  5 11:10 /var/www/michaelnordmeyer.com/feed.xml
-rw-r--r-- 1 user group  21350 Apr  5 11:10 /var/www/michaelnordmeyer.com/feed.xml.gz
$ grep marginalia michaelnordmeyer.com.log 
[04/Apr/2025:15:57:10.133 +0000] 200 193.183.0.165 "GET /feed.xml HTTP/2.0" "-" "search.marginalia.nu" gzip/21225
[05/Apr/2025:18:32:57.705 +0000] 200 193.183.0.165 "GET /feed.xml HTTP/2.0" "-" "search.marginalia.nu" gzip/21350
[06/Apr/2025:21:08:11.011 +0000] 200 193.183.0.165 "GET /feed.xml HTTP/2.0" "-" "search.marginalia.nu" gzip/21350

The last request at 06/Apr/2025:21:08:11.011 should have been conditional and should have resulted in a 304. The server's timezone is set to UTC.

I run nginx mainline:

$ nginx -v
nginx version: nginx/1.27.5
  • Files are pre-gzipped and used by nginx with gzip_static on.
  • Only default ETags are being used.

michaelnordmeyer avatar Apr 07 '25 20:04 michaelnordmeyer

Short version: I think this is nginx jank in how it deals with If-Modified-Since (which is to say, poorly). In the short term I'm altering the logic to only send the If-None-Match header if it is available, and omit If-Modified-Since unless that's the only option, as that seems to solve the problem.

Long version: I've done some digging. I can't seem to make your server respect if-modified-since. With curl, I did a request and got headers from the server that were

last-modified: Thu, 24 Apr 2025 12:08:51 GMT
etag: "680a29d3-21a6d"

So I ran

  curl \
    -H"If-Modified-Since: Fri, 25 Apr 2025 10:49:13 GMT"\
    -H"If-None-Match: \"680a29d3-21a6d\""\
    -H"User-Agent: search.marginalia.nu"\
    https://michaelnordmeyer.com/feed.xml

and this gives me a 200.

Though this gives me a 304:

  curl \
    -H"If-None-Match: \"680a29d3-21a6d\""\
    -H"User-Agent: search.marginalia.nu"\
    https://michaelnordmeyer.com/feed.xml

I can't get it to 304 on just If-Modified-Since.

At least according to MDN, the server should ignore If-Modified-Since in the presence of If-None-Match.

(I'm also curiously getting different E-Tags for the same endpoint when I run the search engine's feed fetcher, though I'm seeing the same behavior there, it works with just the etag, but not with both fields populated.)

As mentioned, I think this is nginx not dealing well with the If-Modified-Since header. I'm seeing the same weird behavior when testing with my blog.

vlofgren avatar Apr 25 '25 11:04 vlofgren

Committed fix 77f727a5babff0cd6445af6f9270f410b2bc98dd

vlofgren avatar Apr 25 '25 11:04 vlofgren

Thank you for looking into it.

As you can see from my output above, I host static files which have been pre-gzipped by me to lower request processing even more. I have updated my issue above with nginx settings.

I will investigate on my side and use your findings as well.

michaelnordmeyer avatar Apr 25 '25 11:04 michaelnordmeyer

I even dug up the specs on this. Seems really clear on how the server should act in this scenario.

A recipient MUST ignore If-Modified-Since if the request contains an If-None-Match header field; the condition in If-None-Match is considered to be a more accurate replacement for the condition in If-Modified-Since, and the two are only combined for the sake of interoperating with older intermediaries that might not implement If-None-Match.

Though I'm too used to specs and reality being two wildly different things when dealing with web servers :P I'll write an issue on the nginx issue tracker and see what they have to say about it.

vlofgren avatar Apr 25 '25 11:04 vlofgren

Testing with your example from above, for

curl \
  -H"If-Modified-Since: Thu, 24 Apr 2025 12:08:51 GMT"\
  -H"User-Agent: search.marginalia.nu"\
  https://michaelnordmeyer.com/feed.xml

nginx returns a 304, which is correct. But if the date in If-Modified-Since is not the exact date from last-modified, nginx returns a 200.

It begs the question if a request using an arbitrary date for If-Modified-Since is a valid request in terms of conditional requests, because the last-modified date is the relevant date and not the request date.

But you are quite right with nginx not ignoring the If-Modified-Since in the presence of If-None-Match.

By the way, because the site is built by Jekyll, the feed.xml will be regenerated every time I push a change. And even if the feed's content didn't change, the <updated> tag of the feed always will, because it's the build date, resulting in a new ETag and last-modified response header.

And thank you for opening an issue with nginx.

michaelnordmeyer avatar Apr 25 '25 12:04 michaelnordmeyer

Regarding the relevant date for If-Modified-Since, the RFC 9110 states:

When used for cache updates, a cache will typically use the value of the cached message's Last-Modified header field to generate the field value of If-Modified-Since. This behavior is most interoperable for cases where clocks are poorly synchronized or when the server has chosen to only honor exact timestamp matches (due to a problem with Last-Modified dates that appear to go "back in time" when the origin server's clock is corrected or a representation is restored from an archived backup).

michaelnordmeyer avatar Apr 25 '25 12:04 michaelnordmeyer

Oh, sorry, I forgot to comment on this:

(I'm also curiously getting different E-Tags for the same endpoint when I run the search engine's feed fetcher, though I'm seeing the same behavior there, it works with just the etag, but not with both fields populated.)

Well, at least the curl requests you did above don't have gzip turned on. They're missing the --compressed parameter.

michaelnordmeyer avatar Apr 25 '25 12:04 michaelnordmeyer

Hm, yeah, the feed fetcher is synthesizing the date based on the local clock, with the idea that since it won't revisit until at least a day later, clock skews shouldn't matter. That might cause problems for servers that do an exact match against the mtime of the file itself (as seems to be the nginx default behavior), though it shouldn't matter with an etag present.

vlofgren avatar Apr 25 '25 12:04 vlofgren

[..] By the way, because the site is built by Jekyll, the feed.xml will be regenerated every time I push a change. And even if the feed's content didn't change, the <updated> tag of the feed always will, because it's the build date, [..]

I have a fix for this at https://github.com/jekyll/jekyll-feed/pull/368. If anyone knows someone in the Jekyll community that can land this, that'd be great!

Krinkle avatar Aug 28 '25 00:08 Krinkle