varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

Stash objcore references until the end of the task to avoid copies

Open nigoroll opened this issue 1 year ago • 22 comments

This proposal was motivated by #3768, which is about avoiding to make copies of constant strings by special casing. This PR does not yet include one additional detail from #3768 [^1], but it solves the underlying root cause.

Context

Any reference handled in VCL needs to have at least PRIV_TASK lifetime. We notoriously shied away from formalizing this definition, but it is a factual consequence from how workspaces work. Rollbacks reset the respective task's workspace and thus finalize all PRIV_TASKs.

One way of ensuring PRIV_TASK lifetime is by copying the referenced value (usually a string) to the task's workspace, and we do this today.

Yet this is wasteful, because static strings from VCL and pointers to memory on the heap already outlive the task lifetime.

The only objects which did not already have PRIV_TASK lifetime were attributes from objects, because object references got returned before restarts. b92.vtc illustrates this case.

For this reason and this reason only do we currently copy all strings to the respective workspace.

Avoid copies by giving object references task lifetime

This proposal gives object references the same lifetime as PRIV_TASK and removes the then unnecessary workspace copies. As a side effect, it also solves the case for #3768 [^1], because this also avoids most copies of static strings to workspace.

In the past, an argument had been made (IIRC by @mbgrydeland) that keeping object references until the end of the task would increase their lifetime by too much, but restarts in VCL really should be done within milliseconds in most cases - and if keeping references is an actual problem in specific situations, those can be avoided by either not restarting or rolling back also. In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.


[^1]: When we set an existing header to a new header with the same name as in set req.http.name = resp.http.name, we currently create a new HEADER on the workspace. This could be avoided by allowing a HEADER argument to some new SetHdr() variant similarly to #3768.

nigoroll avatar Feb 11 '25 12:02 nigoroll

A restart occurring after vcl_miss can take orders of magnitude longer than milliseconds.

For this scenario to be relevant, it would need to be a return(restart) from sub vcl_deliver {}, which then ends up being a miss. But yes, I am aware and I was specifically referring to this argument in the last paragraph of the initial comment.

I think objcores are resources we should always try to hold onto sparingly

What would be the convincing argument for that?

I brought my case, so if you disagree, what is yours?

nigoroll avatar Feb 11 '25 13:02 nigoroll

But yes, I am aware and I was specifically referring to this argument in the last paragraph of the initial comment.

And I quoted the last paragraph, except the last sentence I don't understand.

The sentence is:

In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.

My point is: This patch makes the common case cheaper.

This proposal gives object references the same lifetime as PRIV_TASK and removes the then unnecessary workspace copies. As a side effect, it also solves the case for #3768, because this also avoids most copies of static strings to workspace.

It was a quick glance, but I don't see how ensuring an at-least-task lifetime to objcores solves anything for static strings.

Without this patch, static strings need to be copied always. Without this patch, the set HEADER = HEADER optimization is not possible in the general case.

In #3768 I added a new function to the runtime that sets a header with a string guaranteed to outlive the task (user's responsibility) and taught libvcc to better keep track of constant expressions to favor the new function when possible.

I think I understand what you are doing, and I think we should first have a more general improvement. We should first make sure that every reference in VCL at least has the same lifetime as PRIV_TASK, then we can avoid copying no matter the source. Your set HEADER = HEADER optimization makes sense (see footnote of the initial comment), but we do not need to mark vcl statics constant, because they are just a special case of "at least PRIV_TASK lifetime".

I would argue that with objcores guaranteed to outlive VCL execution, we could teach libvcc to recognize obj.stuff as a constant expression:

Now you seem confused to me. Also with #3768, set req.http.foo = obj.http.foo can not save a workspace copy unless we also have this PR. And yes, again, a set HEADER = HEADER optimization can save copies for the special case that the header names match, and for statics from VCL we can make it so that they do.

Your change is only affecting core code and it would easily compose with #3768 to extend obj.* "constness" to VCL code

It is not just affecting core code, because vmods suffer from the same copying[^1]. And I have said all the time that it would compose with the set HEADER = HEADER optimization from #3768. The main point is that we do not avoid copying by marking an additional special case, the main point is that we avoid it by making sure that "all things referenced" have PRIV_TASK equivalent lifetime.

(Oh man, I am repeating myself for the fifth time now or so.)

My main concern is that object storage is a more critical resource than workspaces. If we need a lot of workspace, we can lower task concurrency. When storage is saturated and in constant churn (when you eventually reach full capacity) it becomes crucial that dying objects actually go away. Increasing latency here means that your churn throughput will increase, reducing your cache/storage efficiency.

I understand what an object reference implies. Also we should note that for most objects on a typical Varnish-Cache installation, the bulk of object references will be held because of body delivery.

here the main point is that all the scenarios where the reference would be kept longer than before the patch are special cases, which can be avoided by not restarting, or rolling back in addition to the restart. None of these cases is typical, and the argument that for some special case some object would be held for somewhat longer than before the patch really seems not significant.

I had a closer look [...]

So was the first half of your comment from "before you looked closer" and the last paragraph from after?

in the normal case we stash nothing and drop the objcore reference in cnt_finish(), right?

Yes

But I really don't like how the stash is put together. Please also note that dropping the ref in cnt_finish() makes the normal case "not quite task-scoped" but is it good enough? If the answer is yes, then the stash should be cleared alongside.

The stash is only allocated if needed. The call to ocstash_fini() is a noop if stash_oc() was not called. It only gets called for the "exception path", of which restart is most relevant.

[^1]: Actually vmod_objvar probably benefits the most, because taskvar.string replacing use cases of HEADER will then not use any copies for STRING inputs and could be extended to also support STRANDS to avoid even more copying.

nigoroll avatar Feb 11 '25 17:02 nigoroll

The sentence is:

In general, until now we have charged all Varnish-Cache users with the cost for specific use cases, but we should rather only charge the specific use cases instead.

My point is: This patch makes the common case cheaper.

I can parse it now, got it. I generally agree with tradeoffs in favor of the common cases.

Without this patch, static strings need to be copied always. Without this patch, the set HEADER = HEADER optimization is not possible in the general case.

Agreed, with the understanding that "this patch" refers to #3768. But I think you are reading too much into my patch series, see below.

I think I understand what you are doing, [...] Your set HEADER = HEADER optimization makes sense

I don't think I can take credit for the set HEADER = HEADER optimization, or I should look at #3768 to make sure. My optimization was set HEADER = "literal string". Then I suggested marking obj header as constants to treat them like literal strings in libvcc, and then you came up with a generalization (assuming guaranteed task lifetime of objcores).

I would argue that with objcores guaranteed to outlive VCL execution, we could teach libvcc to recognize obj.stuff as a constant expression:

Now you seem confused to me. Also with #3768, set req.http.foo = obj.http.foo can not save a workspace copy unless we also have this PR.

Not confused, right? The idea was assuming "objcores guaranteed to outlive VCL execution", but your set HEADER = HEADER idea would encompass obj headers.

Your change is only affecting core code and it would easily compose with #3768 to extend obj.* "constness" to VCL code

It is not just affecting core code, because vmods suffer from the same copying. And I have said all the time that it would compose with the set HEADER = HEADER optimization from #3768.

We actually agree, this is poor wording on my end. I think both changes would complement each other well and open the door to a new optimization.

So was the first half of your comment from "before you looked closer" and the last paragraph from after?

Hours before the beginning and the end of my review...

The stash is only allocated if needed. The call to ocstash_fini() is a noop if stash_oc() was not called. It only gets called for the "exception path", of which restart is most relevant.

Then considering the workspace gains we can expect, we should avoid the complications of a just-in-time allocation, and certainly not fall back to a heap allocation. I think we should give it the reqtop treatment and simply always make room for the stash.

dridi avatar Feb 12 '25 10:02 dridi

As the rest of the discussion looks like it might be resolved, I will only respond to the last paragraph:

we should avoid the complications of a just-in-time allocation

Again, I wanted to keep impact on the common case minimal.

not fall back to a heap allocation.

This is unlikely and simplifies the calling code.

nigoroll avatar Feb 12 '25 14:02 nigoroll

Again, I wanted to keep impact on the common case minimal.

The impact is already a negative workspace footprint.

This is unlikely and simplifies the calling code.

So does a small systematic allocation (56B by default) during the struct req setup.

dridi avatar Feb 12 '25 14:02 dridi

So does a small systematic allocation (56B by default) during the struct req setup.

It is not like I had not considered this option. My worry is that it will impact users with high max_restarts values substantially, and likely needlessly.

On the malloc fallback, I think it is generally a good idea to not fail in our supporting facilities when we could be in an exception code path.

nigoroll avatar Feb 12 '25 16:02 nigoroll

On the malloc fallback, I think it is generally a good idea to not fail in out supporting facilities.

And I disputed this in a previous comment (https://github.com/varnishcache/varnish-cache/pull/4269#discussion_r1951147079). If we have dedicated allocators in the form of workspaces for tasks, then task allocations should be performed there and respect the configured limits.

It's doable, I already offered to submit such a change.

And then there are the cases where a workspace allocation may not be appropriate (for example gzip_buffer, h2 stream window etc). A VMOD author is still free to allocate PRIV_TASK data from the heap.

dridi avatar Feb 12 '25 16:02 dridi

Sure is the workspace allocation failure handling doable.

The point here is that if we run into the ws overflow at this point, a subsequent request will already fail, and we will induce massive overhead basically everywhere, from handling the error, possibly running into a restart etc. etc. The adminstrator will hopefully notice the ws overflow and do something about it.

But at any rate, doing a heap allocation for this exception path is, I think, offset by the gain in simplicity: Because we might already be on an exception path when we stash an oc, we save ourselves from complicating it further.

Do not get me wrong: Yes, we should use the workspace whenever feasible.

FTR on the other topic: The allocation of the struct vrt_priv is not to be confused with the user controlled (struct vmod_priv).priv member.

nigoroll avatar Feb 12 '25 17:02 nigoroll

But at any rate, doing a heap allocation for this exception path is, I think, offset by the gain in simplicity: Because we might already be on an exception path when we stash an oc, we save ourselves from complicating it further.

The simplest approach is a preemptive allocation of the stash that completely removes the need for just-in-time allocations and gets rid of all allocation-related branches.

Ignoring a failed workspace allocation here is just delaying the actual workspace failure somewhere in vcl_synth for the synth/fail cases (reminder, the built-in vcl_synth makes allocations, but we have a candidate fix for that). In the restart case, there is no point processing an entire task again with an overflowed workspace.

This is actually increasing the distance between the root cause and the symptoms.

Again, reserving a stash upfront:

  • gets rid of branches
  • avoids the worst case scenario a heap allocation
    • and some malloc overhead to perform, track and free the allocation
    • and a potential delayed failure caused by the workspace overflow
  • does not introduce error handling at the call site

I do agree in the dynamic priv case that it brings error handling to the caller. In fact VRT_priv_task() used to be fallible so it would only bring error handling back. It would also bring dynamic privs back to respecting the task's memory budget.

FTR on the other topic: The allocation of the struct vrt_priv is not to be confused with the user controlled (struct vmod_priv).priv member.

That's exactly the distinction I was making. The struct should be allocated from the relevant workspace, the member is up to the VMOD author.

Quoting myself:

A VMOD author is still free to allocate PRIV_TASK data from the heap.

Replace my "data" with your more accurate "(struct vmod_priv).priv member".

:-1: for not noticing the pun :]

I think the simplest actually is:

  • manage stash in cache_req.c
  • allocate stash upfront from mempool (like reqtop)
  • export Req_StashObjcore()
  • clear stash in Req_Cleanup()

Pros and cons:

  • less moving parts, better encapsulation
  • no new error paths
  • fixed tiny workspace overhead
    • paying dividends as soon as you set up a resp for vcl_deliver

And if the stash turns out to have a prohibitive cost[^1] then we can complicate the picture and refactor the allocation to make it just-in-time. The extra care you added is premature optimization at this point.

[^1]: unlike a reqtop systematically allocated upfront even in the absence of sub-requests

dridi avatar Feb 12 '25 18:02 dridi

I disagree that it is the better option, because it charges all users with a workspace allocation which will not be needed in most cases. The struct reqtop case is different, because that struct is of fixed size and we actually do need it always to not add special casing all over the place for vcl0, VCL req.top. and PRIV_TOP. Yet you are right that we allocate it also for ESI subrequests, which would not be needed. I have commented on the Req_Cleanup and cache_req.c suggestions further up.

nigoroll avatar Feb 12 '25 18:02 nigoroll

fixed tiny workspace overhead

It is not fixed. It depends on max_restarts.

paying dividends as soon as you set up a resp for vcl_deliver

No. The straight path does not need it. The common cases where it is needed are return(restart) and return(synth) from vcl_deliver{} and vcl_hit{}.

nigoroll avatar Feb 12 '25 19:02 nigoroll

It is not fixed. It depends on max_restarts.

Sorry, I meant fixed for the duration of the task.

paying dividends as soon as you set up a resp for vcl_deliver

No. The straight path does not need it. The common cases where it is needed are return(restart) and return(synth) from vcl_deliver{} and vcl_hit{}.

I meant that your change will pay dividends with the workspace savings outweighing the upfront allocation even for the common cases.

I have commented on the Req_Cleanup and cache_req.c suggestions further up.

I submitted #4271 to illustrate what I meant about better encapsulation. What leaks out of cache_req.c is minimal.

dridi avatar Feb 13 '25 08:02 dridi

@dridi On the ReqFiniObjcoreStash() / ocstash_fini() / ocstash_clear() call site (called "the cleanup" in what follows to avoid the bikeshed):

I agree that calling it in the middle of Req_Rollback() after the VCL_TaskLeave() is better. I now also agree that calling it in CNT_Request() for the REQ_FSM_DONE case is wrong, because I had overlooked that filters (VDPs) can also have a look at headers from their init callback, and because headers could reference object storage, the cleanup needs to run after the vdp init. But calling it in Req_Cleanup() is also wrong: If we moved the cleanup there, the concern that we held onto oc references for "way too long" became valid, because Req_Cleanup() is called after all of the body processing is done (edit: this aspect is actually even broader, because VMODs could allocate returned HEADER values which become invalid after their free callback...)

nigoroll avatar Feb 13 '25 09:02 nigoroll

I'm not following you here, isn't REQ_FSM_DONE reached from cnt_finish() for the normal object delivery case? In that case we already delivered the body in cnt_transmit().

That's why I said that Req_Cleanup() is called soon after leaving the state machine in a previous comment (https://github.com/varnishcache/varnish-cache/pull/4269#discussion_r1954051048).

dridi avatar Feb 13 '25 10:02 dridi

FWIW I'm fine assuming longer objcore retention for very specific use cases:

  • return restart/synth from hit/deliver steps
  • return pass from hit step

Failing from deliver stashes the objcore for a very short time in #4271 because of the rollback happening before entering vcl_synth, and pushing to the stack is a cheap constant-time fail-safe operation with the upfront allocation. All that to justify why there isn't a "failure in hit/deliver steps" bullet point.

I don't remember ever seeing more than one restart. By far the most common case I have seen is the purge+restart combo, and it's not affected by this change. The second most common case I have seen is a request changed to perform a (usually cacheable) request to check authorizations before delivering privileged cache hits, and since req needs to be restored to perform the original request, a rollback would naturally prevent objcore retention.

I have seen cases where the first authorization request grabs a token so a rollback becomes impossible.

I have never seen a real-world case for a pass-from-hit (yes, intended).

So I'm fine with these changes:

  • objcore stash (this pull request or 4271)
  • skipping the workspace copies (this pull request or 4271)
  • skipping workspace copies for literal strings (3768)
  • skipping workspace copies for set HEADER = HEADER;

dridi avatar Feb 13 '25 11:02 dridi

@dridi thank you for your excellent summary. I am now going to continue here and would like to ask you to please apologize my confusion when I wrote "But calling it in Req_Cleanup() is also wrong" here. Yes, we do need to keep all stashed ocs until delivery is complete, because references to them might be used by filters basically anywhere (a filter could also add any task scoped data to the body).

nigoroll avatar Feb 14 '25 11:02 nigoroll

@dridi I have now taken your version of the patch in slightly modified form and added you as Co-Auther by the de-facto standard form (I should have looked this up for some other commits added recently and done the same...).

Regarding the jit vs. upfront allocation, one argument came to my mind which, I think, had not been mentioned, but which made me change my opinion: For a scenario with a substantial amount of restarts which only happen under some circumstances, users might be taken by surprise if their code "suddenly" needs more workspace. Because, to make their use case work, they need to increase the workspace size anyway, they rather notice sooner than by surprise. So, simply put, I now agree that this is the better option by the argument of "predictability before maximum efficiency".

A diff to your version can be found in 75e88cae8. I have done the following:

  • made max_restarts fit into uint16_t to make room for a magic value without spending another sizeof(void *).
  • Added CHECK_OBJ_NOTNULL accordingly
  • Modified Req_New() slightly to make Flexelint happy and avoid zeroing memory twice (while that is still done for the other struct allocated...)

nigoroll avatar Feb 14 '25 14:02 nigoroll

So I still think this is a good idea, but we need to deal with set HEADER = HEADER to actually reap benefits.

b00092.vtc shows how copying from an object is skipped when restarting:

**** v1    vsl|       1001 Debug           c resp.http.method[0]: (oc) 0x7ffb8a60709c O1_M...
**** v1    vsl|       1003 Debug           c resp.http.url[0]: (oc) 0x7ffb8a607199 /o2_...
**** v1    vsl|       1005 Debug           c req.method[0]: (?) 0x7ffb8a60709c O1_M...
**** v1    vsl|       1005 Debug           c req.url[0]: (?) 0x7ffb8a607199 /o2_...

Yes, this is no copy to HEADER, just from HEADER. I think we discussed this topic sufficiently.

But the main benefit is that we do not need to copy for all the other cases, which you questioned again. After this PR, the contract would officially become that all pointers passed around for VCL have to have at least PRIV_TASK lifetime. This was already the de-facto contract, with the exception of the case addressed here.

Regarding your code comments: I do not want to add 4 more comments to this, but what kind of a pointer do you think for return (synth(200, some_vmod.some_string())); the vmod would return?

nigoroll avatar Feb 17 '25 14:02 nigoroll

but what kind of a pointer do you think for return (synth(200, some_vmod.some_string())); the vmod would return?

To be very clear: I think the raised concern is without substance.

nigoroll avatar Feb 26 '25 13:02 nigoroll

@dridi because you requested changes on, I think, an unfounded basis, I would appreciate an update from your end.

nigoroll avatar Mar 19 '25 14:03 nigoroll

@dridi ?

nigoroll avatar Apr 19 '25 11:04 nigoroll

VERY good find by @bsdphk : vcl_miss gets called with an optional stale_oc, which we need to treat equally. Edit: this is addressed in 4f37d4cdb07f6be8cb3a20e6620c1d9d6a559ea6

Also we should consider creating a new request for each restart, as in "treat restarts more like sub-requests".

nigoroll avatar May 19 '25 13:05 nigoroll