libvmod-dynamic icon indicating copy to clipboard operation
libvmod-dynamic copied to clipboard

Assertion error for service with purged dynamic_domain

Open jake-dog opened this issue 3 years ago • 2 comments

Observed on 5c1c68b (w/ varnish 6.0.9) and b72c723 (w/ varnish 6.0.7).

Panic at: Fri, 28 Jan 2022 04:08:36 GMT
Assert error in service_resolve(), vmod_dynamic_service.c line 150:
  Condition((t->dom)->magic == 0x1bfe1345) not true.
version = varnish-6.0.9 revision 3383d3d02c6ee459f20e1f46cfcb9692726a6387, vrt api = 7.1
ident = Linux,5.4.77-flatcar,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 18260503.813988 (mono), 1643342916.576233 (real)
Backtrace:
  0x43c46e: varnishd() [0x43c46e]
  0x49f272: varnishd(VAS_Fail+0x42) [0x49f272]
  0x7fb8159f13da: ./vmod_cache/_vmod_dynamic.04887395d86555991e34ec9874763b01e23b00135a5367941f17a5e0076a9451(+0x83da) [0x7fb8159f13da]
  0x4216a3: varnishd() [0x4216a3]
  0x421d83: varnishd(VDI_GetHdr+0x33) [0x421d83]
  0x42a859: varnishd() [0x42a859]
  0x45a4db: varnishd() [0x45a4db]
  0x45a9c0: varnishd() [0x45a9c0]
  0x7fb8257d2ea5: /lib64/libpthread.so.0(+0x7ea5) [0x7fb8257d2ea5]
  0x7fb8254fbb0d: /lib64/libc.so.6(clone+0x6d) [0x7fb8254fbb0d]

Full panic output from VTC test log.

Reproduction steps

A gist has been created with VTC that reproduces the panic. Please follow the instructions inside of the gist, which includes installing and configuring dnsmasq.

Other test conditions, such as changing A names in SRV responses, have also succeeded in reproducing the assertion error.

Investigation

The assertion error was not reproducible when dynamic_domain usage timeout checks were hardcoded out, blocking timeout purges in the dynamic_search function, demonstrating that the bug occurs when a dynamic_domain is purged while still referenced by a dynamic_service.

The life cycles of dynamic_domain structs, and the dynamic_service structs which reference them, are not tightly coupled. Under specific conditions data races are possible.

One such scenario, as demonstrated in the provided VTC, involves three components:

  1. dynamic_domain structs exclusively referenced by a dynamic_service only have their last_used property updated when SRV resolve is successful (code)
  2. the delay between SRV resolves is set to obj->ttl when the previous DNS resolve fails (code)
  3. when retrieving a specific dynamic_domain, any active domain which doesn't match is subject to a timeout purge (code)

With these components a panic can be instigated via a simple sequence of requests and DNS failures:

  1. Send a request for a valid d.service with DNS running
  2. Kill/block DNS
  3. Send several more requests to the same d.service from step 1 until DNS resolve timeout occurs
  4. Enable/unblock DNS
  5. Send several requests for a different d.service to trigger a timeout purge of the d.service from step 1
  6. Continue sending requests to the d.service from step 1 until panic

This sequence of lookups, and purges, looks like this (from provided VTC log):

**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503346.958248 0.000000 0.000000
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503346.958876 0.000628 0.000628
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1645503346.959250 0.000000 0.000000
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1645503346.959764 0.000514 0.000514
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1645503346.959862 0.000613 0.000099
**** v1    0.7 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1645503346.959927 0.001679 0.001051
**** v1    2.2 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1645503348.470227 0.000000 0.000000
**** v1    7.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1645503353.475618 5.005390 5.005390
**** v1    7.3 vsl|          0 Error           - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503362.042588 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503362.043147 0.000558 0.000558
**** v1   15.8 vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1645503362.043266 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1645503362.043516 0.000000 0.000000
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1645503362.043797 0.000281 0.000281
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1645503362.043895 0.000379 0.000098
**** v1   15.8 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503362.043962 0.001374 0.000815
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1645503363.554317 0.000000 0.000000
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1645503363.554597 0.000280 0.000280
**** v1   17.3 vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1   17.3 vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1645503363.554614 0.000297 0.000017

Note that the service _http._tcp.p-test-3.example.svc.cluster.local is still active, and has not changed, but the domain it points to p-test-3.example.svc.cluster.local has timed out and was eventually deleted.

Solutions

Though not completely without the possibility of data races, and not tested, the following should avoid the bulk of the race conditions. Based on 5c1c68b

diff --git a/src/vmod_dynamic_service.c b/src/vmod_dynamic_service.c
index 2140a2d..c77ca06 100644
--- a/src/vmod_dynamic_service.c
+++ b/src/vmod_dynamic_service.c
@@ -521,3 +521,6 @@ service_lookup_thread(void *priv)
                            res->name, ret, res->strerror(ret));
-                       srv->deadline = results + obj->ttl;
+                       if (obj->ttl < obj->domain_usage_tmo / 2)
+                               srv->deadline = results + obj->ttl;
+                       else
+                               srv->deadline = results + (obj->domain_usage_tmo / 2);
                        dbg_res_details(NULL, srv->obj, res, res_priv);
@@ -737,2 +740,5 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
        struct dynamic_service *srv;
+       const struct service_prios *prios;
+       const struct service_prio *p;
+       const struct service_target *t;
        CHECK_OBJ_NOTNULL(ctx, VRT_CTX_MAGIC);
@@ -750,2 +756,16 @@ vmod_director_service(VRT_CTX, struct VPFX(dynamic_director) *obj,
        srv->last_used = ctx->now;
+
+       VRMB();
+       prios = srv->prios;
+
+       if (prios != NULL) {
+               VTAILQ_FOREACH(p, &prios->head, list) {
+                       CHECK_OBJ_NOTNULL(p, SERVICE_PRIO_MAGIC);
+                       VTAILQ_FOREACH(t, &p->targets, list) {
+                               CHECK_OBJ_NOTNULL(t, SERVICE_TARGET_MAGIC);
+                               CHECK_OBJ_NOTNULL(t->dom, DYNAMIC_DOMAIN_MAGIC);
+                               t->dom->last_used = ctx->now;
+                       }
+               }
+       }
        Lck_Unlock(&obj->mtx);

The problems with the trivial solution include:

  • still possibility of data races
    • multiple services using a shared domain
    • multiple unique services with unique domains causing thread races to purge timed out domains
  • pattern of SRV resolves is inconsistent and diverges from user configuration

I've sent #82 introducing minimal reference counting logic to block dynamic_domain timeout purges while referenced by a dynamic_service, thus tightly coupling them.

jake-dog avatar Feb 18 '22 04:02 jake-dog

To whom it may concern,

I've updated the original issue description with simpler reproduction steps, and a better explanation of ways in which the issue can manifest.

I have also updated my proposed solution #82 which fixes several other behaviors associated with service backends.

jake-dog avatar Feb 22 '22 06:02 jake-dog

Thank you, I've noticed this issue, but need to prioritize paid work. Will get back to you

nigoroll avatar Feb 23 '22 10:02 nigoroll

I let this issue sit open for so long, because I knew that we had a use-after-free due to how varnish-cache worked, which we could not safely avoid in this vmod alone. Now that varnish-cache as of 7.3 has gained the proper infrastructure with reference counted backends, I have implemented the necessary changes (with a lot of ground work to build upon) as of https://github.com/nigoroll/libvmod-dynamic/commit/41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0. I can not reproduce the issue from your test case (thank you!), so would you please give it a go? You need varnish 7.3 or later. (edit: grammar, polish)

nigoroll avatar Jun 29 '23 14:06 nigoroll

I regret to report that I was able to reproduce the issue. Hopefully it's user error but, if it is, I haven't figured out what I did wrong.

I started a debian:11 container, installed varnish 7.3.0, and compiled 41f20790aedfc6d8f5cae4b3b8ffb82cd9b302c0, then re-ran r81.vtc, and it returned the following assertion error:

**** v1    CLI RX|Panic at: Sat, 01 Jul 2023 18:31:29 GMT
**** v1    CLI RX|Assert error in service_resolve(), vmod_dynamic_service.c line 159:
**** v1    CLI RX|  Condition((t->dom)->magic == 0x1bfe1345) not true.
**** v1    CLI RX|version = varnish-7.3.0 revision 84d79120b6d17b11819a663a93160743f293e63f, vrt api = 17.0
**** v1    CLI RX|ident = Linux,6.2.6-76060206-generic,x86_64,-junix,-sdefault,-sdefault,-hcritbit,epoll
**** v1    CLI RX|now = 65401.230598 (mono), 1688236289.146731 (real)
**** v1    CLI RX|Backtrace:
**** v1    CLI RX|  0x5618eee6cd9c: varnishd(+0x5ad9c) [0x5618eee6cd9c]
**** v1    CLI RX|  0x5618eeeea578: varnishd(VAS_Fail+0x18) [0x5618eeeea578]
**** v1    CLI RX|  0x7f64303b53b9: ./vmod_cache/_vmod_dynamic.99f589bd7d2f24a756baa78d3445c179c9428e09634fc67b82af176f80022348(+0x93b9) [0x7f64303b53b9]
**** v1    CLI RX|  0x5618eee4da93: varnishd(VRT_DirectorResolve+0x43) [0x5618eee4da93]
**** v1    CLI RX|  0x5618eee4db8e: varnishd(+0x3bb8e) [0x5618eee4db8e]
**** v1    CLI RX|  0x5618eee4dcf0: varnishd(VDI_GetHdr+0x60) [0x5618eee4dcf0]
**** v1    CLI RX|  0x5618eee59693: varnishd(+0x47693) [0x5618eee59693]
**** v1    CLI RX|  0x5618eee588d6: varnishd(+0x468d6) [0x5618eee588d6]
**** v1    CLI RX|  0x5618eee94322: varnishd(+0x82322) [0x5618eee94322]
**** v1    CLI RX|  0x5618eee948d4: varnishd(+0x828d4) [0x5618eee948d4]

Looking through the test log, the early purge of the domain p-test-3.example.svc.cluster.local, noted in the issue summary, still occurs.

**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236272.014368 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236272.014684 0.000317 0.000317
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Lookup: 1688236272.014757 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Results: 1688236272.014947 0.000190 0.000190
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Update: 1688236272.015109 0.000352 0.000162
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Update: 1688236272.015133 0.000766 0.000449
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Lookup: 1688236273.525239 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-3.example.svc.cluster.local) Results: 1688236278.550587 5.025348 5.025348
**** v1    vsl|          0 Error           - vmod-dynamic: vcl1 d1 _http._tcp.p-test-3.example.svc.cluster.local getdns 902 (All queries for the name timed out)
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236287.040717 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236287.040961 0.000245 0.000245
**** v1    vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 timeout
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-3.example.svc.cluster.local.:8893) Done: 1688236287.041001 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Lookup: 1688236287.041046 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Results: 1688236287.041254 0.000208 0.000208
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(p-test-2.example.svc.cluster.local.:8892) Update: 1688236287.041366 0.000321 0.000113
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236287.041388 0.000672 0.000427
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Lookup: 1688236288.551718 0.000000 0.000000
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Results: 1688236288.552140 0.000422 0.000422
**** v1    vsl|          0 VCL_Log         - vmod-dynamic: vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted
**** v1    vsl|          0 Timestamp       - vmod-dynamic vcl1.d1(srv _http._tcp.p-test-2.example.svc.cluster.local) Update: 1688236288.552416 0.000698 0.000276

I am still reviewing the code changes in 41f2079, and varnish-cache 7.3, so I might better understand where things went wrong.

jake-dog avatar Jul 01 '23 19:07 jake-dog

@jake-dog thank you very much, I am going to stare at it again.

nigoroll avatar Jul 03 '23 08:07 nigoroll

See issue linked above, it appears to me that reference counting can not work yet.

nigoroll avatar Jul 03 '23 11:07 nigoroll

... but still this issue is independent: services need to take references on domains within vmod_dynamic. I'll have a look.

nigoroll avatar Jul 03 '23 11:07 nigoroll

@jake-dog can you please try dc08a31678f5e9ce0f5f2469e6b9bcdc728e1077 and reopen this ticket if it does not fix the issue? Thank you.

nigoroll avatar Jul 03 '23 15:07 nigoroll

It seems that dc08a31 resolves the use-after-free issue, but introduces unexpected behaviors I attempted to address in #82, notably:

  • the dynamic_domain referenced by the service is still purged, resulting in director . . . returned no backend At first glance this seems like the correct behavior, but its inconsistent with the behavior of dynamic_domain by itself. For instance if the nameserver were to be inaccessible, a dynamic_domain can continue to be used so long as it keeps being used. A dynamic_domain linked to a dynamic_service on the other hand will be purged as the result of a nameserver outage even if the dynamic_service continues to be used. Further I would argue that a nameserver being inaccessible/down isn't a good reason to stop attempting to reach already resolved backends.

  • the idiosyncratic behavior of limiting SRV resolve TTL to 0.5 * domain_usage_tmo remains I could be wrong but I believe the only reason this behavior exists is as a mitigation for the use-after-free issue resolved in dc08a31.

Those outstanding issues I believe are better suited to new issues however. Great work, and thank you!

jake-dog avatar Jul 09 '23 21:07 jake-dog

@jake-dog yes, it took me a de-tour to realize it in #102 but the lifetime issue should be fixed for real now. Sorry for having noticed your comment too late.

I will look at the TTL now.

nigoroll avatar Jul 14 '23 13:07 nigoroll

2d51da74680e849c8411bcd6d85cd53c149b3052

nigoroll avatar Jul 14 '23 13:07 nigoroll

Haven't had much time to test it but so far 2d51da7 looks great.

The only oddity I see is when running r81.vtc it logs that it deleted p-test-3, but continues to use it:

**** v1    vsl|          0 VCL_Log         - vmod-dynamic vcl1 d1 p-test-3.example.svc.cluster.local.:8893 deleted (expired)
**** v1    vsl|       1007 FetchError      b backend d1.p-test-3.example.svc.cluster.local.(127.0.0.1:8893): fail errno 111 (Connection refused)
**** v1    vsl|       1006 ReqMethod       c GET
**** v1    vsl|       1006 ReqURL          c /
**** v1    vsl|       1006 RespStatus      c 503

To be clear: this is the desired behavior for backends-- the logging is just misleading.

I will be running more tests in the next few weeks, and will certainly create an issue if I discover anything more nefarious.

jake-dog avatar Jul 21 '23 03:07 jake-dog