varnish-cache RFC: backend fetch scheduling questions

Based on recurrent observations from production setups I would like to bring a few discussion points and see where we'd like to go with that. I'll take the liberty of expressing my personal opinion for each individual point.

Prevent waiting list serialization

Setting a zero TTL on a cacheable beresp is a common source of significant performance degradation, usually leading to a pile up of requests landing in a waiting list. One common mistake is to set the TTL to zero to express the intent of not caching the object (when this lack of TTL does not come from the backend) and returning early from vcl_backend_response for some unclear reason (often copied from some code snippet found online).

The idea would be to force a last-resort hit-for-miss outcome for a hard-coded duration when control goes back to core code.

I'm :+1:

Cap the waiting list size

Either a global count or something per object head, that hasn't been discussed in great details. Once the waiting list is full, we stop queuing requests and they simply fail.

I'm having a hard time figuring my side of the fence, on one :+1: I think we should avoid unbound queues and on the other :-1: I'm worried that traffic spikes would needlessly suffer from such a cap.

Waiting list timeout

It should be self-explanatory, don't stay longer than a certain limit in the waiting list. Once a request times out, should it fail or reenter lookup with hash_always_miss set? Hasn't been discussed in great details yet.

I'm :-1: based on a feeling that this would create a major performance headache, having recently worked in that area.

Waiting list rush limit

When a waited fetch turns out not to be cacheable, we gradually wake request tasks up, but sometimes it can lead to a slew of traffic and bring a backend down. If that is even possible, it would be nice to have a rush_limit parameter.

I'm generally :+1: but unsure regarding feasibility.

Connection pool timeout

When we set a non-zero max_connections on a backend, the outcome becomes instant error the moment this threshold is reached, making it an awkward feature. It would maybe help to conditionally wait for backend_pool_timeout seconds when max_connections is in effect. It doesn't have to be a mechanism like the waiting list, and why not simply block the backend task?

I'm :+1:

Nov 16 '20 16:11 dridi

I think this is tricky because there are a handful of situations which will end up with 0s + uncacheable, some either done in error and others possibly done on purpose (maybe without understanding the consequences).

If we consider a scenario where your backend randomly throws 500 errors on occasion, then 0s + uncacheable might actually make sense. This delivers the error to a single client, limiting how many clients get the error. It also limits backend traffic. If we do option #1, its almost like we are saying, "give me an error and now I send a lot traffic back to you, possibly making your problems worse".

If the problem is that request serialization can cause Varnish to buildup a ton of clients, then I feel the solution is for Varnish to shed those clients in a responsible way. Waiting list capping and timeouts allows for that while following the intent of the VCL.

Nov 16 '20 17:11 rezan

first reaction (I have a strong feeling that I might be overlooking something and might need to add more later):

I have also seen the ttl=0 case in production and it actually is a major PITA. In my mind, one major contributor to the current situation is that before HFM, ttl=0 was the only way to avoid HFP.

Prevent waiting list serialization

I think we need to add another complication: We also run into this case when the object arriving from the backend is already expired, even with ttl > 0s (e.g. Cache-Control: max-age=30 Age: 31).

The idea would be to force a last-resort hit-for-miss outcome for a hard-coded duration when control goes back to core code.

When we briefly touched this topic on IRC recently, my immediate reaction was to pursue this option. But I now think we actually need to differentiate the two cases: Would the object be cacheable with "no ttl" or would it actually be uncacheable? In the above case, I think we should rather deliver that one stale copy to all waiting requests.

So would the following be a better option?

turn ttl == 0s into ttl = 1m HFM (1m being symbolic for the time to be coined)
upon waitinglist lookup, check if the new object is hfm/hfp and, if not, just deliver it no matter how stale?

Also, we should think about the internal ttl code again: We use ttl < 0 internally to signify "not cacheable", but never expose negative ttls to vcl.

Cap the waiting list size

I heavily tend towards :-1: on this. I think we should address the root cause rather than spitting out errors when we failed to properly handle requests.

Waiting list timeout

:-1: for the same reason.

Waiting list rush limit

I can see a point here if the waited-for object actually is uncacheable. rush_exponent makes a lot of sense to achieve exponential client request dequeuing with staggering, but it is wrong for the uncacheable case.

So :+1: on the overall idea to have a different rate limitation for uncacheable, but I am not sure yet if exactly the rush limit is the best option (see below)

Connection pool timeout

I think if we were to have something like this, we should combine it with a queue. We might even be able to unload the threads. That said, for the timeout I favor the generic idea of a total (be)req timeout which @Dridi and myself already wrote up together with a larger proposal for timeouts overhaul.

Nov 16 '20 17:11 nigoroll

sorry, I had hit the wrong button

Nov 16 '20 17:11 nigoroll

ok, I think I may have figured out our problem here. beresp.uncacheable is being used for 2 very different scenarios:

Private objects
Error responses

Looking back at some of the issues we have seen, problems arise when people use uncacheable with errors. This kind of makes sense because people either skip the TTL and return early and get an implicit 0s TTL or they don't understand that errors need to be cached and they explicitly add a 0s TTL.

So the solution is to split these use cases apart by introducing something like:

beresp.is_error

This would almost be a do nothing variable in that it will prevent people from using uncacheable. Semantically, is_error would be the same as any cacheable object. It could mirror the TTL logic used in vcl_backend_error in that it defaults to a 0s TTL unless there is a waitinglist, then it gets a small TTL. This can be done before vcl_backend_response by checking if beresp.status >= 500 and VCL would have the final say.

Anyway, there is a slight problem here, and that is that error responses are sometimes private objects in that they contain IP address, stack traces, etc. So the whole idea that errors can be cached is not universal. It might be worth investigating if error responses normally come with cache-control?

Nov 16 '20 19:11 rezan

You can cache synthetic responses from vcl_backend_error given a positive TTL, your assumption is broken.

Nov 16 '20 20:11 dridi

You can cache synthetic responses from vcl_backend_error given a positive TTL, your assumption is broken.

Not sure I understand. The context here is error responses from a backend in vcl_backend_response...

Nov 16 '20 20:11 rezan

@Dridi , @rezan, I think there's a confusion between the HTTP "error" sent by the backend, and the one generated by Varnish when it can't connect to the backend. If I'm correct, i think beresp.is_error is wrong because errors and private objects don't form a partition of the response space (you can have private errors).

Just a note on the "Waiting list timeout" point, purely as a way to error out of a transaction. I don't care much about enforcing an HfM once the timeout triggers, but I do like the ability of sending a synthetic response, notably to avoid losing the connection on H/1 requests because of a client timeout

Nov 16 '20 23:11 gquintard

100% with @gquintard here, beresp.is_error is nogo.

One could argue that being on the waiting list should be subject to the first_byte_timeout.

Problem is we dont know the number, precisely because we were shunted onto the waiting list and never reached vcl_backend_fetch{}

Nov 30 '20 09:11 bsdphk

Ya, my original proposal was to add timeout_waitinglist and when a request hits it, move it to vcl_synth with an error response status. We could also add max_waitinglist and when thats hit, move directly to vcl_synth. I believe we had a framework in place to signal all the different internal errors, not sure what happened with that?

Nov 30 '20 14:11 rezan