libvmod-dynamic
libvmod-dynamic copied to clipboard
Assertion error for service with purged dynamic_domain
Observed on 5c1c68b (w/ varnish 6.0.9) and b72c723 (w/ varnish 6.0.7).
Panic at: Fri, 28 Jan 2022 04:08:36 GMT
Assert error in service_resolve(), vmod_dynamic_service.c line 150:
Condition((t->dom)->magic == 0x1bfe1345) not true.
version = varnish-6.0.9 revision 3383d3d02c6ee459f20e1f46cfcb9692726a6387, vrt api = 7.1
ident = Linux,5.4.77-flatcar,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 18260503.813988 (mono), 1643342916.576233 (real)
Backtrace:
0x43c46e: varnishd() [0x43c46e]
0x49f272: varnishd(VAS_Fail+0x42) [0x49f272]
0x7fb8159f13da: ./vmod_cache/_vmod_dynamic.04887395d86555991e34ec9874763b01e23b00135a5367941f17a5e0076a9451(+0x83da) [0x7fb8159f13da]
0x4216a3: varnishd() [0x4216a3]
0x421d83: varnishd(VDI_GetHdr+0x33) [0x421d83]
0x42a859: varnishd() [0x42a859]
0x45a4db: varnishd() [0x45a4db]
0x45a9c0: varnishd() [0x45a9c0]
0x7fb8257d2ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fb8257d2ea5]
0x7fb8254fbb0d: /lib64/libc.so.6(clone+0x6d) [0x7fb8254fbb0d]
Full panic output from VTC test log.
Reproduction steps
A gist has been created with VTC that reproduces the panic. Please follow the instructions inside of the gist, which includes installing and configuring dnsmasq.
Other test conditions, such as changing A names in SRV responses, have also succeeded in reproducing the assertion error.
Investigation
The assertion error was not reproducible when dynamic_domain usage timeout checks were hardcoded out, blocking timeout purges in the dynamic_search function, demonstrating that the bug occurs when a dynamic_domain is purged while still referenced by a dynamic_service.
The life cycles of dynamic_domain structs, and the dynamic_service structs which reference them, are not tightly coupled. Under specific conditions data races are possible.
One such scenario, as demonstrated in the provided VTC, involves three components:
dynamic_domainstructs exclusively referenced by adynamic_serviceonly have theirlast_usedproperty updated when SRV resolve is successful (code)- the delay between SRV resolves is set to
obj->ttlwhen the previous DNS resolve fails (code) - when retrieving a specific
dynamic_domain, any active domain which doesn't match is subject to a timeout purge (code)
With these components a panic can be instigated via a simple sequence of requests and DNS failures:
- Send a request for a valid
d.servicewith DNS running - Kill/block DNS
- Send several more requests to the same
d.servicefrom step 1 until DNS resolve timeout occurs - Enable/unblock DNS
- Send several requests for a different
d.serviceto trigger a timeout purge of thed.servicefrom step 1 - Continue sending requests to the
d.servicefrom step 1 until panic
This sequence of lookups, and purges, looks like this (from provided VTC log):
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503346.958248 0.000000 0.000000
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503346.958876 0.000628 0.000628
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1645503346.959250 0.000000 0.000000
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1645503346.959764 0.000514 0.000514
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1645503346.959862 0.000613 0.000099
**** v1 0.7 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1645503346.959927 0.001679 0.001051
**** v1 2.2 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503348.470227 0.000000 0.000000
**** v1 7.3 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503353.475618 5.005390 5.005390
**** v1 7.3 vsl| 0 Error - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503362.042588 0.000000 0.000000
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503362.043147 0.000558 0.000558
**** v1 15.8 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1645503362.043266 0.000000 0.000000
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1645503362.043516 0.000000 0.000000
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1645503362.043797 0.000281 0.000281
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1645503362.043895 0.000379 0.000098
**** v1 15.8 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503362.043962 0.001374 0.000815
**** v1 17.3 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503363.554317 0.000000 0.000000
**** v1 17.3 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503363.554597 0.000280 0.000280
**** v1 17.3 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1 17.3 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503363.554614 0.000297 0.000017
Note that the service _http._tcp.p-test-3.example.svc.cluster.local is still active, and has not changed, but the domain it points to p-test-3.example.svc.cluster.local has timed out and was eventually deleted.
Solutions
Though not completely without the possibility of data races, and not tested, the following should avoid the bulk of the race conditions. Based on 5c1c68b
diff --git a/src/vmod_dynamic_service.c b/src/vmod_dynamic_service.c
index 2140a2d..c77ca06 100644
--- a/src/vmod_dynamic_service.c
+++ b/src/vmod_dynamic_service.c
@@ -521,3 +521,6 @@ service_lookup_thread(void *priv)
res->name, ret, res->strerror(ret));
- srv->deadline = results + obj->ttl;
+ if (obj->ttl < obj->domain_usage_tmo / 2)
+ srv->deadline = results + obj->ttl;
+ else
+ srv->deadline = results + (obj->domain_usage_tmo / 2);
dbg_res_details(NULL, srv->obj, res, res_priv);
@@ -737,2 +740,5 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
struct dynamic_service *srv;
+ const struct service_prios *prios;
+ const struct service_prio *p;
+ const struct service_target *t;
CHECK_OBJ_NOTNULL(ctx, VRT_CTX_MAGIC);
@@ -750,2 +756,16 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
srv->last_used = ctx->now;
+
+ VRMB();
+ prios = srv->prios;
+
+ if (prios != NULL) {
+ VTAILQ_FOREACH(p, &prios->head, list) {
+ CHECK_OBJ_NOTNULL(p, SERVICE_PRIO_MAGIC);
+ VTAILQ_FOREACH(t, &p->targets, list) {
+ CHECK_OBJ_NOTNULL(t, SERVICE_TARGET_MAGIC);
+ CHECK_OBJ_NOTNULL(t->dom, DYNAMIC_DOMAIN_MAGIC);
+ t->dom->last_used = ctx->now;
+ }
+ }
+ }
Lck_Unlock(&obj->mtx);
The problems with the trivial solution include:
- still possibility of data races
- multiple services using a shared domain
- multiple unique services with unique domains causing thread races to purge timed out domains
- pattern of SRV resolves is inconsistent and diverges from user configuration
I've sent #82 introducing minimal reference counting logic to block dynamic_domain timeout purges while referenced by a dynamic_service, thus tightly coupling them.
To whom it may concern,
I've updated the original issue description with simpler reproduction steps, and a better explanation of ways in which the issue can manifest.
I have also updated my proposed solution #82 which fixes several other behaviors associated with service backends.
Thank you, I've noticed this issue, but need to prioritize paid work. Will get back to you
I let this issue sit open for so long, because I knew that we had a use-after-free due to how varnish-cache worked, which we could not safely avoid in this vmod alone. Now that varnish-cache as of 7.3 has gained the proper infrastructure with reference counted backends, I have implemented the necessary changes (with a lot of ground work to build upon) as of https://github.com/nigoroll/libvmod-dynamic/commit/41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0. I can not reproduce the issue from your test case (thank you!), so would you please give it a go? You need varnish 7.3 or later. (edit: grammar, polish)
I regret to report that I was able to reproduce the issue. Hopefully it's user error but, if it is, I haven't figured out what I did wrong.
I started a debian:11 container, installed varnish 7.3.0, and compiled 41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0, then re-ran r81.vtc, and it returned the following assertion error:
**** v1 CLI RX|Panic at: Sat, 01 Jul 2023 18:31:29 GMT
**** v1 CLI RX|Assert error in service_resolve(), vmod_dynamic_service.c line 159:
**** v1 CLI RX| Condition((t->dom)->magic == 0x1bfe1345) not true.
**** v1 CLI RX|version = varnish-7.3.0 revision 84d79120b6d17b11819a663a93160743f293e63f, vrt api = 17.0
**** v1 CLI RX|ident = Linux,6.2.6-76060206-generic,x86_64,-junix,-sdefault,-sdefault,-hcritbit,epoll
**** v1 CLI RX|now = 65401.230598 (mono), 1688236289.146731 (real)
**** v1 CLI RX|Backtrace:
**** v1 CLI RX| 0x5618eee6cd9c: varnishd(+0x5ad9c) [0x5618eee6cd9c]
**** v1 CLI RX| 0x5618eeeea578: varnishd(VAS_Fail+0x18) [0x5618eeeea578]
**** v1 CLI RX| 0x7f64303b53b9: ./vmod_cache/_vmod_dynamic.99f589bd7d2f24a756baa78d3445c179c9428e09634fc67b82af176f80022348(+0x93b9) [0x7f64303b53b9]
**** v1 CLI RX| 0x5618eee4da93: varnishd(VRT_DirectorResolve+0x43) [0x5618eee4da93]
**** v1 CLI RX| 0x5618eee4db8e: varnishd(+0x3bb8e) [0x5618eee4db8e]
**** v1 CLI RX| 0x5618eee4dcf0: varnishd(VDI_GetHdr+0x60) [0x5618eee4dcf0]
**** v1 CLI RX| 0x5618eee59693: varnishd(+0x47693) [0x5618eee59693]
**** v1 CLI RX| 0x5618eee588d6: varnishd(+0x468d6) [0x5618eee588d6]
**** v1 CLI RX| 0x5618eee94322: varnishd(+0x82322) [0x5618eee94322]
**** v1 CLI RX| 0x5618eee948d4: varnishd(+0x828d4) [0x5618eee948d4]
Looking through the test log, the early purge of the domain p-test-3.example.svc.cluster.local, noted in the issue summary, still occurs.
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236272.014368 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236272.014684 0.000317 0.000317
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1688236272.014757 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1688236272.014947 0.000190 0.000190
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1688236272.015109 0.000352 0.000162
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1688236272.015133 0.000766 0.000449
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236273.525239 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236278.550587 5.025348 5.025348
**** v1 vsl| 0 Error - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236287.040717 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236287.040961 0.000245 0.000245
**** v1 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1688236287.041001 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1688236287.041046 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1688236287.041254 0.000208 0.000208
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1688236287.041366 0.000321 0.000113
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236287.041388 0.000672 0.000427
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236288.551718 0.000000 0.000000
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236288.552140 0.000422 0.000422
**** v1 vsl| 0 VCL_Log - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1 vsl| 0 Timestamp - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236288.552416 0.000698 0.000276
I am still reviewing the code changes in 41f2079, and varnish-cache 7.3, so I might better understand where things went wrong.
@jake-dog thank you very much, I am going to stare at it again.
See issue linked above, it appears to me that reference counting can not work yet.
... but still this issue is independent: services need to take references on domains within vmod_dynamic. I'll have a look.
@jake-dog can you please try dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077 and reopen this ticket if it does not fix the issue? Thank you.
It seems that dc08a31 resolves the use-after-free issue, but introduces unexpected behaviors I attempted to address in #82, notably:
-
the
dynamic_domainreferenced by the service is still purged, resulting indirector . . . returned no backendAt first glance this seems like the correct behavior, but its inconsistent with the behavior ofdynamic_domainby itself. For instance if the nameserver were to be inaccessible, adynamic_domaincan continue to be used so long as it keeps being used. Adynamic_domainlinked to adynamic_serviceon the other hand will be purged as the result of a nameserver outage even if thedynamic_servicecontinues to be used. Further I would argue that a nameserver being inaccessible/down isn't a good reason to stop attempting to reach already resolved backends. -
the idiosyncratic behavior of limiting SRV resolve TTL to
0.5 * domain_usage_tmoremains I could be wrong but I believe the only reason this behavior exists is as a mitigation for the use-after-free issue resolved in dc08a31.
Those outstanding issues I believe are better suited to new issues however. Great work, and thank you!
@jake-dog yes, it took me a de-tour to realize it in #102 but the lifetime issue should be fixed for real now. Sorry for having noticed your comment too late.
I will look at the TTL now.
2d51da74680e849c8411bcd6d85cd53c149b3052
Haven't had much time to test it but so far 2d51da7 looks great.
The only oddity I see is when running r81.vtc it logs that it deleted p-test-3, but continues to use it:
**** v1 vsl| 0 VCL_Log - vmod-dynamic vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted (expired)
**** v1 vsl| 1007 FetchError b backend d1.p-test-3.example.svc.cluster.local.(127.0.0.1:8893): fail errno 111 (Connection refused)
**** v1 vsl| 1006 ReqMethod c GET
**** v1 vsl| 1006 ReqURL c /
**** v1 vsl| 1006 RespStatus c 503
To be clear: this is the desired behavior for backends-- the logging is just misleading.
I will be running more tests in the next few weeks, and will certainly create an issue if I discover anything more nefarious.