varnish-cache POC: New beresp.error VCL variable

Following several VDD discussions, this patch series shows how getting a backend error could look like in VCL, and how it could be articulated in the core code.

This should check the boxes from the last consensus:

a plain text error message
a new VCL variable to get the error (beresp.error)
VCL constants for safe comparisons (error.*)

The first two commits are unrelated, I found a bug in libvcc while I was working on this. I will submit a pull request once I'm done with that bug (only partly fixed here) so the relevant commits to review are the ones starting with "POC".

If we still have consensus on this approach, I will find someone™ to submit something comprehensive.

Apr 29 '24 12:04 dridi

Bugwash summary:

Turn errors into a struct vrt_error { const char *type, *msg };
- @nigoroll wanted to categorize errors (example: "backend timeout")
- I suggested an (ERROR).type VCL attribute
- It coud also be a list of tags (not brought up during bugwash)
Turn beresp.error into a proper ERROR
Teach $Error to vmods (add vmod_debug coverage)
Error messages from the documentation should be logged (I meant to initially and forgot)

Then we reevaluate the error reporting.

May 06 '24 15:05 dridi

I wanted to revisit this ticket this morning and noticed that I forgot about the libvcc bug. I pushed 0c8448e0ef...b77b709510 to fix it.

Jul 09 '24 16:07 dridi

bugwash label removed because this ticket needs an update

Sep 23 '24 13:09 nigoroll

New proposal:

struct vrt_error {
        const char *msg;
        const char *tags; /* comma-separated list of VCL identifiers */
};

Can be produced by a VMOD and put in context, to be consumed from vcl_backend_error (and probably vcl_synth too). The error can be null, like a header.

VCL defines new variables:

bereq.error

	Type: ERROR

	Readable from: vcl_backend_error

	TODO: description (has a tostring yielding `msg` or `NULL`).

bereq.error.*

	Type: BOOL

	Readable from: vcl_backend_error

	TODO: description, tells whether the error has an arbitrary tag.

We gradually start defining ERROR constants in VCL, as we increase our bereq.error coverage for at least faults happening in core code:

error.connection_refused

	Type: ERROR

	Error message: Connection refused by the backend

	Tags: backend, connect

	TODO: description (maybe mention ECONNREFUSED etc)

The error.* symbols are implicitly read-only and unrestricted in scope, in other words global constants.

In VCL you could do something like this:

sub vcl_backend_error {
        if (!bereq.error.backend || bereq.error == error.connection_refused) {
                return (retry);
        }
}

Likewise, VMODs learn to define errors in their VCC file:

$Error name, "msg", tag, tags... # XXX: implicit tag with the name of the VMOD?

And you can use them also:

import foo;

sub vcl_backend_error {
        if (bereq.error.foo && bereq.error != foo.bar) {
                return (abandon);
        }
}

edit: changed all beresp references to bereq to bind the error to the task (and only [be]req is available for the whole task, so it could be extended beyond backend_error/synth).

Oct 14 '24 09:10 dridi

New proposal:

That's nice! Can you please illustrate the use of tags?

Nov 07 '24 18:11 nigoroll

That's nice! Can you please illustrate the use of tags?

The idea of tags is to classify errors and allow operating on classes of errors instead of individual ones. We should be conservative about tags, as we can easily add them, but removing is a breaking change.

In VCL you could do something like this:

sub vcl_backend_error {
        if (!bereq.error.backend || bereq.error == error.connection_refused) {
                return (retry);
        }
}

You can read this example as "I'm retrying the backend fetch if the error was not related to a backend, unless it was specifically a connection refused".

I suppose it would be more accurate like this for the return (error) case:

sub vcl_backend_error {
        if (bereq.error && (!bereq.error.backend ||
            bereq.error == error.connection_refused)) {
                return (retry);
        }
}

In summary, multiple errors may share one or more tags.

Tag ideas:

backend
timeout
proto
workspace

Something we can figure out as we start populating errors coming from core code.

Nov 08 '24 07:11 dridi

@dridi I still do not get what is the tag in the vcl example. Can you please illustrate?

Feb 03 '25 14:02 nigoroll

VDD:

General agreement that we should improve:
- clarity of the log
- visibility from vcl
categories (is "tag" a good name?) should be available from vcl and also be reflected in the log
should counters reflect the same errors (might be too many)? going back to VSL for details should be ok
add error codes with cross-reference to documentation?

please think about the details and make a concrete proposal

which categories do we want? (with operational experience)
How will it look like in vcl and vsl? Keep strings or move to IDs?
- VSL: gate behind a feature for backwards compatibility?
check if we need something similar for the client side (vcl_synth)

May 27 '25 08:05 nigoroll

categories (is "tag" a good name?) should be available from vcl and also be reflected in the log

I meant "tag" in the taxonomy sense. IMO tags shouldn't appear in logs, the whole idea in my proposal is to have the following moving parts:

a list of errors available in the vcl-var manual
- read-only symbols such as error.connection_refused
- an error message
- a list of tags
- a description
the possibility for VMODs to expose their own errors
- vmod_foo could expose its own foo.connection_refused
the error message used in logs
- cross-reference error message in manual
- akin to SLT_Notice, but for errors

I don't think we should log tags.

General agreement that we should improve:
* clarity of the log

* visibility from vcl

With the moving parts I'm suggesting, the result should be something like "FetchError Connection refused by the backend" and looking for "Connection refused by the backend" I should be able to find error.connection_refused and its associated tags.

In general, once we start populating error messages, we should be better off than we currently are with the generic error messages we get in some places (like "FetchError HTC idle (3)") and improve log clarity.

To add to the moving parts from above, there are more VCL symbols:

new req.error and bereq.error variables
- pointing for example to foo.connection_refused
- [be]req.error evaluates to the error message
- [be]req.error.* is a wildcard BOOL symbol evaluating tags

should counters reflect the same errors (might be too many)? going back to VSL for details should be ok

Ideally, we should map VCL errors to existing errors whenever applicable. The new error.* read-only symbols declared in the vcl-var manual offers a natural anchor to mention related counters.

add error codes with cross-reference to documentation?

In the model described above, we have error names instead of error codes. I think we should avoid having both.

That being said, maybe errors could define an HTTP status when applicable. For example, connect_timeout could map to 504 (Gateway Timeout). I didn't think of that before now to be honest.

which categories do we want? (with operational experience)

I think we should be very conservative about tags/categories. We can add more without breaking existing VCL, but we can't remove them. Once possibility could be to start with no or next to no tags (but validate the wildcard model with vmod_debug) and migrate as many core errors as possible to VCL-defined errors before starting a taxonomy.

How will it look like in vcl and vsl? Keep strings or move to IDs?

I think we should log the verbatim error messages, as they will be more helpful in a troubleshooting situation.

We leave the documentation indirection to "learning more about this error" as a second step. For example to learn more about its precise meaning, monitoring options, and potential remediation suggestions.

VSL: gate behind a feature for backwards compatibility?

I'd rather not.

check if we need something similar for the client side (vcl_synth)

Yes we do, my suggestion is to anchor errors to the task, so either req.error or bereq.error. It's what I suggested in https://github.com/varnishcache/varnish-cache/pull/4097#issuecomment-2410520172.

In a private thread, I went a bit further and suggested that an error can occur any time, and shouldn't be limited to vcl_synth or vcl_backend_error:

import geoloc; # defines an error with a bad_ip tag

sub vcl_recv {
        if (!geoloc.ate(client.ip) && req.error.bad_ip) {
                # do something
        }
}

That should be the last bit to clarify so far.

Jun 02 '25 17:06 dridi