varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

rfc: A storage engine for synth responses

Open nigoroll opened this issue 9 months ago • 3 comments

Context: This proposal partly replaces #4341, but has been written to stand on its own.

The basic steps for generating synthetic response bodies from vcl_synth{} currently are (as of 033d1d45ef72b806e6bd5ba73cacb1ea9835ec8f):

Via this code path, we end up making two copies of the arguments to VRT_l_xyz_body(). There is a similar code path to handle vcl_backend_error {}, which differs in that the second copy is to a stevedore object which potentially can be longer lived (represent a "synthetic cache object"), and is likely to outlive the scope of the backend request.

So there is optimization potential here to save two (for vcl_synth{}) and one (for vcl_backend_error{}) copies of the body data, respectively.

To optimize both cases, I propose to do the following:

  • collect the VRT_l_xyz_body() arguments in arrays of io vectors. The required data structure has already been proposed as VSCARAB in #4209
  • For vcl_backend_error{}, copy the io vectors to the storage object, basically identical to now, but saving one copy
  • For vcl_synth{}, implement a trivial stv_synth stevedore, which takes the arrays of io vectors via a private interface and later "outputs" them via the object iterator callback.

The proposed stv_synth is safe because, all arguments ever passed to VRT_l_xyz_body() have at least PRIV_TASK lifetime (see also #4269 for a similar optimization based on the same insight).

The net effect of this proposal, if implemented, is to save two memory copy operations for body data created in vcl_synth{} and save one memory copy operation for body data created in vcl_backend_error{}.

nigoroll avatar Jun 03 '25 07:06 nigoroll

FYI; I ran some trivial benchmark with perf profiling and the SMA locks are the main contender when we remove the private_oh bottleneck.

This is while running wrk -c 1000 -d 60 -t 100 http://127.0.0.1:8080 against

sub vcl_synth {
	set resp.body = "42";
	return (deliver);
}

with default storage:

$ sudo perf report --stdio -g --call-graph=folded,1 --symbol-filter=CNT_Request | sed 's:;__.*::'
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 25K of event 'cycles:P'
# Event count (approx.): 821420111031
#
# Children      Self  Command       Shared Object  Symbol         
# ........  ........  ............  .............  ...............
#
    92.63%     0.15%  cache-worker  varnishd       [.] CNT_Request
10.92% CNT_Request;cnt_synth;STV_NewObject;SML_allocobj;sma_alloc;Lck__Lock
10.71% CNT_Request;cnt_synth;ObjGetSpace;sml_getspace;objallocwithnuke;sml_stv_alloc;sma_alloc;Lck__Lock
10.58% CNT_Request;cnt_transmit;V1D_Deliver;VDP_DeliverObj;ObjIterate;sml_iterator;sml_stv_free;sma_free;Lck__Lock
10.35% CNT_Request;cnt_finish;HSH_DerefObjCore;ObjFreeObj;sml_objfree;sml_stv_free;sma_free;Lck__Lock
7.94% CNT_Request;cnt_synth;ObjGetSpace;sml_getspace;objallocwithnuke;sml_stv_alloc;sma_alloc;Lck__Unlock
7.94% CNT_Request;cnt_synth;STV_NewObject;SML_allocobj;sma_alloc;Lck__Unlock
7.79% CNT_Request;cnt_transmit;V1D_Deliver;VDP_DeliverObj;ObjIterate;sml_iterator;sml_stv_free;sma_free;Lck__Unlock
7.49% CNT_Request;cnt_finish;HSH_DerefObjCore;ObjFreeObj;sml_objfree;sml_stv_free;sma_free;Lck__Unlock

So I am going to go ahead with the implementation for the main reason that the "workspace stv" is going to be lock free, and I think that the benefit of not copying will actually be less relevant.

If anyone has more feedback, you can still stop or encourage me ;) .

nigoroll avatar Jun 23 '25 14:06 nigoroll

The required data structure has already been proposed as VSCARAB in Prepare Varnish-Cache for the age of AI #4209

You may want to rename this one, my first read led to an unfortunate segmentation fault.

If anyone has more feedback, you can still stop or encourage me ;) .

The thing I don't understand from prior discussions is why we can't use storage for synthetic responses just like we do with regular fetches. Sharing the same infrastructure between "natural" and synthetic objects would make many things much simpler, like filter support for example.

We can already retry a fetch on the same objcore. Granted, this retry can only happen before we start storing bytes, but I don't see why we couldn't pay the cost of undoing a partial pseudo-fetch if our VCL contains this:

set beresp.body = "initial value";
set beresp.body = "changed my mind";

We can even use SLT_Notice to inform users paying attention to logs that they run into this VCL anti-pattern (in addition to mentioning the pitfall in regular docs).

For the general case, this should be fine:

set resp.body = "initial value";
set resp.body += ", and additional data";

In vcl_synth, we would populate storage directly, and then pass that through a regular delivery pipeline (with the ability to gzip for example).

In vcl_backend_error, we do have one kink for which I can only see a VRT failure, and an additional challenge. As we are populating beresp.body, the bytes should go through a fetch pipeline (why not synthetic ESI?) before landing in storage. This means that changing beresp.filter when beresp.body is not empty should result in an error. The other challenge is the need for re-entrance for the fetch pipeline (which may already be possible, but I don't think so off the top of my head).

I think the first thing to do would be an attempt to turn synthetic objects into regular ones, and assess what we get from a unified storage and pipeline infrastructure for all objects. I'd rather try that first before attempting an even more specialized implementation.

dridi avatar Jun 23 '25 14:06 dridi

The thing I don't understand from prior discussions is why we can't use storage for synthetic responses just like we do with regular fetches.

The current proposal is to use storage for synthetic responses, just that for the client side (vcl_synth {}) use case, the object would not get created by copying data, but rather by just referencing it.

Setting attributes is easy, because ObjSetAttr() passes the source pointer. But writing to the Body requires copying: ObjGetSpace() provides a memory region to write to, hence a memcpy() is inevitable.

So for the client case only, I would like to create a special-purpose storage engine with a private interface which allows to pass in the extents / iovecs to the body data.

Sharing the same infrastructure between "natural" and synthetic objects would make many things much simpler, like filter support for example.

All other aspects of the storage API would remain untouched, in particular ObjGetAttr() and ObjIterate() (or the new VAI variant for that matter). So yes, running a filter for synth bodies remains possible as much as it is possible today.

NB, all this was regarding the client side.

Regarding the backend side, I think you have a point. I thought it would pay off to use the same "synth api" as for the client side, but this does not make sense if we ever want to "push the body through filters". And for that matter, on the backend side the performance gains are less relevant anyway.

We can already retry a fetch on the same objcore. Granted, this retry can only happen before we start storing bytes, but I don't see why we couldn't pay the cost of undoing a partial pseudo-fetch if our VCL contains this:

set beresp.body = "initial value"; set beresp.body = "changed my mind";

Sure, we can also give up on the busy object and start again. This will at least save one copy.

In vcl_backend_error, we do have one kink for which I can only see a VRT failure, and an additional challenge. As we are populating beresp.body, the bytes should go through a fetch pipeline (why not synthetic ESI?) before landing in storage. This means that changing beresp.filter when beresp.body is not empty should result in an error. The other challenge is the need for re-entrance for the fetch pipeline (which may already be possible, but I don't think so off the top of my head).

Yes. Maybe we should actually do the filter stuff before anything else.

I think the first thing to do would be an attempt to turn synthetic objects into regular ones, and assess what we get from a unified storage and pipeline infrastructure for all objects. I'd rather try that first before attempting an even more specialized implementation.

I would not want to fully commit to exactly this, but in general, I tend to agree.

nigoroll avatar Jul 01 '25 17:07 nigoroll