varnish-cache condfetch: Skip revalidation of an invalidated stale

If we can catch a dying stale_oc before initiating a condfetch, we can fall back to a regular fetch instead. If we catch it too late, when we are ready to merge headers, we transition to vcl_backend_error and get a chance to retry.

This avoids a race between a condfetch and the invalidation of its stale objcore, covered with a purge in test cases.

Mar 12 '24 18:03 dridi

@nigoroll @bsdphk: I think the first commit should be merged before the release.

Mar 13 '24 09:03 dridi

@hermunn brought an interesting point regarding bans. We check the stale object against newer bans during lookup, but in the absence of another lookup or if the ban lurker doesn't get a chance to evaluate post-lookup bans (and the default ban_lurker_age almost guarantees that) the stale object will not be invalidated.

To get the discussion started I pushed what could look like evaluating bans before and after 304 revalidation. I will turn this pull request into a draft.

Mar 15 '24 17:03 dridi

I understand that Dridi prepared these patches in a rush, so once the dust has settled, I would appreciate better explanations in the commit messages.

Regarding the actual issue here, I am not convinced yet that failing with a fetch error for the race is the best option (it might be, but I would like to think about it). There will probably be many sites for which the current behavior is just fine, and who would start to see errors unless they added the return(retry), so we should at least add this to the builtin vcl and docs.

If someone issues a PURGE, they expect that the respective object go away and not survive indirectly through a revalidation. The current suggestion to fail to v_b_e {} and then retry under VCL control has the advantage of replacing the busy object and thus not delivering the corrupted stale content. But we could also just accept the revalidated object and deliver it (once), also implicitly purging it. Or we could do the retry internally without VCL involvement.

Again, not sure yet.

Mar 18 '24 10:03 nigoroll

bugwash:

go back to the original scope of the pull request
- leave post-lookup ban checks out for now
add 304 retry in built-in vcl_backend_error
docs

Mar 18 '24 14:03 dridi

Updated as per bugwash consensus.

Mar 19 '24 11:03 dridi

@AlveElde brought up an interesting point in an offline discussion. An implicit retry from the built-in VCL is not desirable since there is no guarantee that vcl_backend_fetch is idempotent.

I added at the beginning of this patch series two commits that:

split the startfetch step in two (prepfetch, then what remains for startfetch)
introduce return (retry(fetch)) and return (retry(task)) transitions

The return (retry(task)) one is present for completeness and is otherwise equivalent to return (retry). In all cases a retry increments the retries counter and is limited by the max_retries parameter.

This allows an effective retry at the actual step where we initiate the backend request, skipping vcl_backend_fetch and staying in the same transaction. The commit introducing this variation on retries is minimalist, lacking coverage and documentation.

On the other hand, the original commits from this pull request were updated with their documentation and coverage adjusted.

Jun 18 '24 16:06 dridi

I agree with the points you raise. I still think that an explicit retry(fetch) action would be very useful outside of the built-in VCL to enable retries without tearing down privs for the task, and without going through a potentially-non-idempotent vcl_backend_fetch.

It might be better to have two variables you can check in VCL, one which includes these short false starts, and one which does not.

Or maybe make this opt-in, without a detour via vcl_backend_error?

sub vcl_backend_fetch {
        set bereq.retry_dying_304 = true;
}

No rug pull, no PRIV_TASK loss, no required attention to vcl_backend_fetch idempotency.

Jul 05 '24 16:07 dridi

I think the new VCL variable retry_dying_304 is a good idea. We might also consider documenting that from September, this will/might be default true, so that everyone should make sure to set this in their own configuration (VCL) if they care.

Also, I see your point with return (retry(fetch)) in general, as many users simply want to set a new backend and retry without any sub vcl_backend_fetch logic being relevant. However, if we go with the retry_dying_304 here, the question about return (retry(fetch)) might be better discussed in a different PR.

Jul 08 '24 07:07 hermunn

I'm wondering to which extent we could actually tie this to vcl_backend_refresh instead of vcl_backend_error, having this return (retry(fetch)) in the built-in VCL once we move some of the core code handling 304s to VCL.

See https://github.com/varnishcache/varnish-cache/pull/3994 for more details.

That would make the bereq.retry_dying_304 idea irrelevant.

edit: and this could not qualify as a breaking change in a new subroutine for which there is no preexisting VCL.

Jul 08 '24 08:07 dridi

Overtaken by https://github.com/varnishcache/varnish-cache/pull/3994

Sep 02 '25 14:09 walid-git

condfetch: Skip revalidation of an invalidated stale_oc