varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

director reference counting races

Open nigoroll opened this issue 2 years ago • 4 comments

Expected Behavior

Director reference counting should avoid use-after-frees

Current Behavior

I believe there are still open races with our director reference counting

Simple director assignment

VCL: set bereq.backend = vmod.backend()

this basically translates to: VRT_Assign_Backend(&ctx->bo->director_req, vmod_function())

So what happens when the last reference goes away between the time of vmod_function() returning and VRT_Assign_Backend() referencing it?

Resolve function

VRT_DirectorResolve() does not take/release references. But even if it would, it still left a similar gap as in the previous case, the backend could go away between d->vdir->methods->resolve() looking it up and VRT_DirectorResolve() taking a reference.

(edit: VDI_Http1Pipe() and VDI_GetHdr() call VRT_DirectorResolve() via VDI_Resolve() and they do take references. So the window is larger than for the simple assignment case, but still comparably small)

Possible Solution

#2725 suggested to avoid this and similar problems by putting all to-be-deleted backends onto a cool list, which would be worked when the last request possibly referring any of the to-be-deleted backends went away. With the current implementation, it would appear to me that any function returning a backend would need to provide the reference with it.

Steps to Reproduce (for bugs)

No response

Context

Staring at https://github.com/nigoroll/libvmod-dynamic/issues/81

Varnish Cache version

No response

Operating system

No response

Source of binary packages used (if any)

No response

nigoroll avatar Jul 03 '23 11:07 nigoroll

bugwash: we tend towards changing interfaces returning directors to always return with another reference held. phk said he wanted to work on this.

nigoroll avatar Jul 03 '23 13:07 nigoroll

In the context of https://github.com/nigoroll/libvmod-dynamic/issues/117 I ran into this issue again. Here's a trivial way to trigger the issue:

  • add this delay to varnish:
diff --git a/bin/varnishd/cache/cache_vrt.c b/bin/varnishd/cache/cache_vrt.c
index d8630d347..55c0a6de6 100644
--- a/bin/varnishd/cache/cache_vrt.c
+++ b/bin/varnishd/cache/cache_vrt.c
@@ -799,6 +799,7 @@ VRT_BACKEND_string(VCL_BACKEND d)
 {
        if (d == NULL)
                return (NULL);
+       usleep(200 * 1000);
        CHECK_OBJ_NOTNULL(d, DIRECTOR_MAGIC);
        return (d->vcl_name);
 }
  • use this vcl
vcl 4.1;

import dynamic;
import directors;

backend proforma none;

sub vcl_init {
	new r1 = dynamic.resolver();

	new d = dynamic.director(
		resolver = r1.use(),
		ttl = 0.001s
	);

	new rr = directors.round_robin();
	rr.add_backend(d.backend("www.uplex.de"));
}
sub vcl_recv {
	return (synth(200));
}
sub vcl_synth {
	set resp.body = "" + rr.backend().resolve() + " " +
		d.backend("uplex.de").resolve() + """
""";
	return (deliver);
}
  • run a curl loop
while : ; do curl localhost:$VARNISH_PORT ; done
  • Observe panic within seconds:
Panic at: Tue, 30 Jul 2024 08:11:11 GMT
Assert error in VRT_BACKEND_string(), cache/cache_vrt.c line 803:
  Condition((d)->magic == 0x3336351d) not true.
version = varnish-trunk revision c79f57fc116a958631c79e1c24f57e729df17850, vrt api = 19.1
ident = Linux,6.1.0-15-amd64,x86_64,-jnone,-sdefault,-sdefault,-hcritbit,epoll
now = 3453.033732 (mono), 1722327071.728044 (real)
Backtrace:
  ip=0x565411f56ef5 sp=0x7f68260ca210 <VBT_format+0x35>
  ip=0x565411e8f0b3 sp=0x7f68260ca230 <pan_backtrace+0x33>
  ip=0x565411e8edfa sp=0x7f68260ca250 <pan_ic+0x37a>
  ip=0x565411f56125 sp=0x7f68260ca3d0 <VAS_Fail+0x55>
  ip=0x565411eafce2 sp=0x7f68260ca420 <VRT_BACKEND_string+0xa2>
  ip=0x7f6822055f4b sp=0x7f68260ca440 <VGC_function_vcl_synth+0xcb>
  ip=0x565411ec6e7c sp=0x7f68260ca490 <vcl_call_method+0x61c>
  ip=0x565411ec8c80 sp=0x7f68260ca5d0 <VCL_synth_method+0x1e0>
  ip=0x565411e99019 sp=0x7f68260ca610 <cnt_synth+0x279>
  ip=0x565411e98427 sp=0x7f68260ca6a0 <CNT_Request+0x727>
  ip=0x565411edcd0a sp=0x7f68260ca760 <HTTP1_Session+0xa2a>
  ip=0x565411edc250 sp=0x7f68260ca7e0 <http1_req+0x190>
  ip=0x565411ecd237 sp=0x7f68260ca810 <Pool_Work_Thread+0x7c7>
  ip=0x565411ecc933 sp=0x7f68260ca8a0 <WRK_Thread+0x333>
  ip=0x565411ecc5bb sp=0x7f68260cb430 <pool_thread+0xcb>
  ip=0x7f682edad044 sp=0x7f68260cb460 <pthread_condattr_setpshared+0x4d4>
  ip=0x7f682ee2d61c sp=0x7f68260cb500 <__xmknodat+0x23c>

FTR, I originally saw the issue in the form of a use-after-free, the delay is just a more reliable way to trigger a proper panic.

nigoroll avatar Jul 30 '24 08:07 nigoroll