Panic on vcl.load: Wrong turn at cache/cache_backend_probe.c:710: running
Expected Behavior
No panic on the following VTC.
Important reproduction notes
This is against Varnish master + libvmod-dynamic master. I can only reproduce this when intentionally "slowing down" vmod_event with a sleep.
In my case in production, I originally had the panic when using libvmod-redis, and without adding its import in my VCL, the panic would not be reproduced. I suspected some kind of race condition that only happens when waiting in vmod_event somehow (because that would be the only side effect of loading libvmod-redis).
So, in order to reproduce the panic in the VCL below, add at the start of libvmod-dynamic/src/vmod_dynamic.c vmod_event:
#include <unistd.h>
usleep(100000);
(Or run the following from libvmod-dynamic: sed -z -i -E 's|(vmod_event\(VRT_CTX, struct vmod_priv \*priv, enum vcl_event_e e\)\n\{\n)|\1\#include \<unistd.h\>\nusleep\(100000\)\;\n|g' src/vmod_dynamic.c)
This will simulate a slow down in vmod_event by sleeping for 100ms.
Current Behavior
varnishtest "panic"
varnish v1 -vcl {
vcl 4.1;
backend local none;
} -start
varnish v1 -vcl {
vcl 4.1;
import std;
import dynamic;
import directors;
backend local none;
probe p1 {}
probe p2 {}
sub vcl_init {
new cluster_default = directors.fallback();
new r1 = dynamic.resolver();
new i1 = dynamic.director(port = {"80"}, ttl = 10s, probe = p1, resolver = r1.use(), ttl_from = dns);
cluster_default.add_backend(i1.backend(host={"example.com"}, port={"8000"}) );
new r2 = dynamic.resolver();
new i2 = dynamic.director(port = {"80"}, ttl = 10s, probe = p2, resolver = r2.use(), ttl_from = dns);
cluster_default.add_backend(i2.backend(host={"google.com"}, port={"8001"}) );
}
}
The panic is quite random, so might require several tries:
while true; do varnishtest ./panic.vtc; done
The panic is:
Wrong turn at cache/cache_backend_probe.c:712:
scheduled
Backtrace:
ip=0x63557d52e435 sp=0x7fffaf88dde0 <VBT_format+0x35>
ip=0x63557d459773 sp=0x7fffaf88de00 <pan_backtrace+0x33>
ip=0x63557d459473 sp=0x7fffaf88de20 <pan_ic+0x3e3>
ip=0x63557d52d5f5 sp=0x7fffaf88dfa0 <VAS_Fail+0x55>
ip=0x63557d4180b1 sp=0x7fffaf88dff0 <VBP_Control+0x231>
ip=0x63557d41514f sp=0x7fffaf88e020 <vbe_dir_event+0x15f>
ip=0x63557d42ab8d sp=0x7fffaf88e050 <VDI_Event+0xbd>
ip=0x63557d474b1c sp=0x7fffaf88e070 <vcl_BackendEvent+0x16c>
ip=0x63557d473f6c sp=0x7fffaf88e0a0 <vcl_set_state+0x58c>
ip=0x63557d476cdf sp=0x7fffaf88e0f0 <vcl_load+0x3bf>
ip=0x63557d474f4d sp=0x7fffaf88e140 <vcl_cli_load+0xcd>
ip=0x63557d5327d2 sp=0x7fffaf88e170 <cls_dispatch+0x392>
ip=0x63557d5322bd sp=0x7fffaf88e1b0 <cls_exec+0x2fd>
ip=0x63557d53101f sp=0x7fffaf88e200 <cls_feed+0x58f>
ip=0x63557d5309ff sp=0x7fffaf88e270 <VCLS_Poll+0x3bf>
ip=0x63557d423b02 sp=0x7fffaf8902d0 <CLI_Run+0xf2>
ip=0x63557d4534f5 sp=0x7fffaf890300 <child_main+0x485>
ip=0x63557d4c81da sp=0x7fffaf890370 <mgt_launch_child+0x63a>
ip=0x63557d4c924e sp=0x7fffaf8903f0 <mch_cli_server_start+0x3e>
ip=0x63557d5327d2 sp=0x7fffaf890420 <cls_dispatch+0x392>
ip=0x63557d5322bd sp=0x7fffaf890460 <cls_exec+0x2fd>
ip=0x63557d53101f sp=0x7fffaf8904b0 <cls_feed+0x58f>
ip=0x63557d5309ff sp=0x7fffaf890520 <VCLS_Poll+0x3bf>
ip=0x63557d4ca335 sp=0x7fffaf892580 <mgt_cli_callback2+0x25>
ip=0x63557d534bbe sp=0x7fffaf8925a0 <VEV_Once+0x53e>
ip=0x63557d53465d sp=0x7fffaf8925f0 <VEV_Loop+0xed>
ip=0x63557d4d02f3 sp=0x7fffaf892620 <main+0x1ed3>
ip=0x7b288cde1e08 sp=0x7fffaf892870
ip=0x7b288cde1ecc sp=0x7fffaf892910 <__libc_start_main+0x8c>
ip=0x63557d40df35 sp=0x7fffaf892970 <_start+0x25>
But sometimes the VCL state is running instead of scheduled.
Possible Solution
No response
Steps to Reproduce (for bugs)
No response
Context
This was particularly hard to track down to a MWE VTC :stuck_out_tongue:
Varnish Cache version
varnishd master; libvmod-dynamic master + usleep patch from ticket
Operating system
No response
Source of binary packages used (if any)
No response
From my understanding this could be related to https://github.com/varnishcache/varnish-cache/pull/4115 ?
cc @nigoroll
This is an excellent issue report, thank you! You even made an effort to patch vmod_dynamic with a cut&paste sed command...
(edit) The actual reproducer is a delay added for VCL_EVENT_WARM.
So what I think is happening here:
vmod_dynamic creates domain directors (basically resolver threads) also during vcl_init{}, the feature you are using which is vital to layering. These directors create backends whenever name resolution completes.
If the VCL temperature is still VCL_TEMP_INIT when these backends get created, all is good and well, they get added to the director list and get sent a warm event when the vcl becomes warm by this code:
https://github.com/varnishcache/varnish-cache/blob/e142af75decb675a554434a48e65e55bd0267f0c/bin/varnishd/cache/cache_vcl.c#L633-L637
But by adding the delay, we prolong the vcl_send_event() and chances get (very) high that backends get created when the VCL is already warm, but before the vcl_BackendEvent(). Then they receive two warm events and booom.
This touches on #4142 , but from the other end: Here we would need a director list from "before the warm event" and only these would need to get the event sent...
@delthas does #4205 work for you?
Thanks for the patch! #4205 fixes the issue on my end, the test case passes. :+1:
Hi @delthas I spun another patch, which I hope now really covers all the cases. But how can I be sure? Can you please give #4259 a try?