director reference counting races
Expected Behavior
Director reference counting should avoid use-after-frees
Current Behavior
I believe there are still open races with our director reference counting
Simple director assignment
VCL: set bereq.backend = vmod.backend()
this basically translates to: VRT_Assign_Backend(&ctx->bo->director_req, vmod_function())
So what happens when the last reference goes away between the time of vmod_function() returning and VRT_Assign_Backend() referencing it?
Resolve function
VRT_DirectorResolve() does not take/release references. But even if it would, it still left a similar gap as in the previous case, the backend could go away between d->vdir->methods->resolve() looking it up and VRT_DirectorResolve() taking a reference.
(edit: VDI_Http1Pipe() and VDI_GetHdr() call VRT_DirectorResolve() via VDI_Resolve() and they do take references. So the window is larger than for the simple assignment case, but still comparably small)
Possible Solution
#2725 suggested to avoid this and similar problems by putting all to-be-deleted backends onto a cool list, which would be worked when the last request possibly referring any of the to-be-deleted backends went away. With the current implementation, it would appear to me that any function returning a backend would need to provide the reference with it.
Steps to Reproduce (for bugs)
No response
Context
Staring at https://github.com/nigoroll/libvmod-dynamic/issues/81
Varnish Cache version
No response
Operating system
No response
Source of binary packages used (if any)
No response
bugwash: we tend towards changing interfaces returning directors to always return with another reference held. phk said he wanted to work on this.
In the context of https://github.com/nigoroll/libvmod-dynamic/issues/117 I ran into this issue again. Here's a trivial way to trigger the issue:
- add this delay to varnish:
diff --git a/bin/varnishd/cache/cache_vrt.c b/bin/varnishd/cache/cache_vrt.c
index d8630d347..55c0a6de6 100644
--- a/bin/varnishd/cache/cache_vrt.c
+++ b/bin/varnishd/cache/cache_vrt.c
@@ -799,6 +799,7 @@ VRT_BACKEND_string(VCL_BACKEND d)
{
if (d == NULL)
return (NULL);
+ usleep(200 * 1000);
CHECK_OBJ_NOTNULL(d, DIRECTOR_MAGIC);
return (d->vcl_name);
}
- use this vcl
vcl 4.1;
import dynamic;
import directors;
backend proforma none;
sub vcl_init {
new r1 = dynamic.resolver();
new d = dynamic.director(
resolver = r1.use(),
ttl = 0.001s
);
new rr = directors.round_robin();
rr.add_backend(d.backend("www.uplex.de"));
}
sub vcl_recv {
return (synth(200));
}
sub vcl_synth {
set resp.body = "" + rr.backend().resolve() + " " +
d.backend("uplex.de").resolve() + """
""";
return (deliver);
}
- run a curl loop
while : ; do curl localhost:$VARNISH_PORT ; done
- Observe panic within seconds:
Panic at: Tue, 30 Jul 2024 08:11:11 GMT
Assert error in VRT_BACKEND_string(), cache/cache_vrt.c line 803:
Condition((d)->magic == 0x3336351d) not true.
version = varnish-trunk revision c79f57fc116a958631c79e1c24f57e729df17850, vrt api = 19.1
ident = Linux,6.1.0-15-amd64,x86_64,-jnone,-sdefault,-sdefault,-hcritbit,epoll
now = 3453.033732 (mono), 1722327071.728044 (real)
Backtrace:
ip=0x565411f56ef5 sp=0x7f68260ca210 <VBT_format+0x35>
ip=0x565411e8f0b3 sp=0x7f68260ca230 <pan_backtrace+0x33>
ip=0x565411e8edfa sp=0x7f68260ca250 <pan_ic+0x37a>
ip=0x565411f56125 sp=0x7f68260ca3d0 <VAS_Fail+0x55>
ip=0x565411eafce2 sp=0x7f68260ca420 <VRT_BACKEND_string+0xa2>
ip=0x7f6822055f4b sp=0x7f68260ca440 <VGC_function_vcl_synth+0xcb>
ip=0x565411ec6e7c sp=0x7f68260ca490 <vcl_call_method+0x61c>
ip=0x565411ec8c80 sp=0x7f68260ca5d0 <VCL_synth_method+0x1e0>
ip=0x565411e99019 sp=0x7f68260ca610 <cnt_synth+0x279>
ip=0x565411e98427 sp=0x7f68260ca6a0 <CNT_Request+0x727>
ip=0x565411edcd0a sp=0x7f68260ca760 <HTTP1_Session+0xa2a>
ip=0x565411edc250 sp=0x7f68260ca7e0 <http1_req+0x190>
ip=0x565411ecd237 sp=0x7f68260ca810 <Pool_Work_Thread+0x7c7>
ip=0x565411ecc933 sp=0x7f68260ca8a0 <WRK_Thread+0x333>
ip=0x565411ecc5bb sp=0x7f68260cb430 <pool_thread+0xcb>
ip=0x7f682edad044 sp=0x7f68260cb460 <pthread_condattr_setpshared+0x4d4>
ip=0x7f682ee2d61c sp=0x7f68260cb500 <__xmknodat+0x23c>
FTR, I originally saw the issue in the form of a use-after-free, the delay is just a more reliable way to trigger a proper panic.
For vmod_dynamic, this is now worked around by always keeping another reference for the duration of the task.