vgz: Segmentation fault in deflateReset()
https://varnish-cache.org/lists/pipermail/varnish-misc/2022-November/027172.html
varnishd[15482]: Child (15492) Panic at: Tue, 22 Nov 2022 16:10:26 GMT
Wrong turn at cache/cache_main.c:284:
Signal 11 (Segmentation fault) received at 0xffff887fe000 si_code 2
version = varnish-6.0.11 revision a3bc025c2df28e4a76e10c2c41217c9864e9963b, vrt api = 7.1
ident = Linux,4.14.290-217.505.amzn2.aarch64,aarch64,-junix,-smalloc,-sfile,-sdefault,-hcritbit,epoll
now = 3120063.050288 (mono), 1669133426.954528 (real)
Backtrace:
0x43f5b4: /usr/sbin/varnishd() [0x43f5b4]
0x4a968c: /usr/sbin/varnishd(VAS_Fail+0x54) [0x4a968c]
0x43a350: /usr/sbin/varnishd() [0x43a350]
0xffff90fd7668: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff90fd7668]
0xffff90c8f540: /lib64/libc.so.6(memset+0x100) [0xffff90c8f540]
0x4b88f4: /usr/sbin/varnishd(deflateReset+0x48) [0x4b88f4]
0x42e924: /usr/sbin/varnishd(VGZ_NewGzip+0x88) [0x42e924]
0x42ebc0: /usr/sbin/varnishd() [0x42ebc0]
0x42df28: /usr/sbin/varnishd(VFP_Open+0x98) [0x42df28]
0x42b950: /usr/sbin/varnishd() [0x42b950]
thread = (cache-worker)
thr.req = (nil) {
},
thr.busyobj = 0xffff3d040020 {
end = 0xffff3d050000,
retries = 0,
sp = 0xffff3c241a20 {
fd = 28, vxid = 32945,
t_open = 1669133426.953970,
t_idle = 1669133426.953970,
ws = 0xffff3c241a60 {
id = \"ses\",
{s, f, r, e} = {0xffff3c241aa0, +96, (nil), +344},
},
transport = HTTP/1 {
state = HTTP1::Proc
}
client = 172.31.47.149 50812 :80,
},
worker = 0xffff832326c8 {
ws = 0xffff83232770 {
id = \"wrk\",
{s, f, r, e} = {0xffff83231e00, +0, (nil), +2040},
},
VCL::method = BACKEND_RESPONSE,
VCL::return = deliver,
VCL::methods = {BACKEND_FETCH, BACKEND_RESPONSE},
},
vfc = 0xffff3d041f30 {
failed = 0,
req = 0xffff3d040640,
resp = 0xffff3d040ab8,
wrk = 0xffff832326c8,
oc = 0xffff3b250640,
filters = {
gzip = 0xffff3d04a740 {
priv1 = (nil),
priv2 = 0,
closed = 0
},
V1F_STRAIGHT = 0xffff3d04a660 {
priv1 = 0xffff3d042600,
priv2 = 674132,
closed = 0
},
},
obj_flags = 0x0,
},
ws = 0xffff3d040058 {
id = \"bo\",
{s, f, r, e} = {0xffff3d041f78, +34832, (nil), +57472},
},
ws_bo = 0xffff3d0425e8,
http[bereq] = 0xffff3d040640 {
ws = 0xffff3d040058 {
[Already dumped, see above]
},
hdrs {
\"GET\",
\"/xxxxx/xxxxx/xxx/xxx.xxx?v=xxxxxxx\",
\"HTTP/1.1\",
\"X-Forwarded-Proto: https\",
\"X-Forwarded-Port: 443\",
\"Host: xxxxxx.xxxxxxxxxx.xxxxxxxxxxxxxxxxx.com\",
\"X-Amzn-Trace-Id: xxxx=x-xxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxx\",
\"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/[107.0.0.0](http://107.0.0.0/) Safari/537.36\",
\"X-Amz-Cf-Id: xxxxxxx_xxxxxxxxxxxxx_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx==\",
\"Via: x.x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxx (Cloudfront)\",
\"Cookie: xxxxx=xxxxxxxxxxxxxxxxxxxxxxxxxx; __xxxx=xxxxxxxx.xxxxxxxxx.xxxxxxxxxx.xxxxxxxxxx.xxxxxxxxxx.xx; __xxxx=xxxxxxxx.xx.xx.xxxxxxxxxx; __xxxx=xxxxxxxx; __xxxx=x; __xxxx_x=x; __xxxx=xxxxxxxx.xxxxxxxxxx.x.x.xxxxxx=(xxxxxx)|xxxxxx=(xxxxxx)|xxxxxx=(xxxx); _xxxx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx; _xx=xxx.x.xxxxxxxxx.xxxxxxxxxx; _xx_xxxxxxxxxx=xxx.x.xxxxxxxxxx.xx.x.xxxxxxxxxx.xx.x.x; _xxx_xx=x.x.xxxxxxxxx.xxxxxxxxxx; _xxxxxxx_xxxx=%xx; _xxxx_xxxxxxxx=; _xxxxxxx_x=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx; _x=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx; xx_xxxxxxx=%xx%xxxxxxxxx_xx%xx%xx%xxxxxx%xx%xx%xxxxxxxxx_xxxx%xx%xx%xxxxxxxxx%xx%xx%xxxxxxxxx_xxxx%xx%xx%xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx%xx%xx\",
\"Accept-Language: en-US,en;q=0.9,fr;q=0.8\",
\"Accept: text/css,*/*;q=0.1\",
\"Referer: xxxxx://xxxxxx.xxxxxxxxxx.xxxxxxxxxxxxxxxxx.xxx/xxxxxxxxxxxxxxxx/xxxxx/xxxxxxxx/xxxx/xx%xx",
\"Accept-Encoding: gzip, deflate, br\",
\"pragma: no-cache\",
\"cache-control: no-cache\",
\"sec-gpc: 1\",
\"sec-fetch-site: same-origin\",
\"sec-fetch-mode: no-cors\",
\"sec-fetch-dest: style\",
\"X-Forwarded-For: xx.xxx.xxx.x, xx.xxx.xx.xxx, xxx.xx.xx.xxx\",
\"X-Varnish: 32947\",
},
},
http[beresp] = 0xffff3d040ab8 {
ws = 0xffff3d040058 {
[Already dumped, see above]
},
hdrs {
\"HTTP/1.1\",
\"200\",
\"OK\",
\"Server: nginx/1.20.0\",
\"Date: Tue, 22 Nov 2022 16:10:26 GMT\",
\"Content-Type: text/css\",
\"Content-Length: 674132\",
\"Last-Modified: Tue, 22 Nov 2022 15:28:13 GMT\",
\"Connection: keep-alive\",
\"Expires: Wed, 22 Nov 2023 16:10:26 GMT\",
\"Cache-Control: max-age=31536000, public, no-transform\",
\"Accept-Ranges: bytes\",
\"X-Url: /xxxxx/xxxxx/xxx/xxx.xxx?x=xxxxxxxx\",
\"X-Host: xxxxxx.xxxxxxxxxx.xxxxxxxxxxxxxxxxx.xxx/)\",
},
},
objcore[fetch] = 0xffff3b250640 {
refcnt = 2,
flags = {busy, hfm, private},
exp_flags = {},
boc = 0xffff3b260160 {
refcnt = 2,
state = req_done,
vary = (nil),
stevedore_priv = (nil),
},
exp = {1669133426.954405, -1.000000, 300.000000, 0.000000},
objhead = 0xffff907d01d0,
stevedore = (nil),
},
http_conn = 0xffff3d042600 {
fd = 30 (@0xffff3be704e4),
doclose = NULL,
ws = 0xffff3d040058 {
[Already dumped, see above]
},
{rxbuf_b, rxbuf_e} = {0xffff3d042660, 0xffff3d0427a8},
{pipeline_b, pipeline_e} = {0xffff3d0427a8, 0xffff3d04a660},
content_length = 674132,
body_status = length,
first_byte_timeout = 180.000000,
between_bytes_timeout = 180.000000,
},
flags = {do_gzip, do_stream, do_pass, uncacheable},
director_req = 0xffff8ffc0190 {
vcl_name = default,
health = healthy,
admin_health = probe, changed = 1669131098.893551,
type = backend {
display_name = boot.default,
ipv4 = 127.0.0.1,
port = 8080,
hosthdr = 127.0.0.1,
n_conn = 1,
},
},
director_resp = director_req,
vcl = {
name = \"boot\",
busy = 5,
discard = 0,
state = auto,
temp = warm,
conf = {
syntax = \"40\",
srcname = {
\"/etc/varnish/default.vcl\",
\"Builtin\",
},
},
},
},
vmods = {
std = {Varnish 6.0.11 a3bc025c2df28e4a76e10c2c41217c9864e9963b, 0.0},
},
@Dridi let me know if you need any further information etc
@Dridi thanks for pushing this over to Github.
Appreciate bugs like this are fixed as people have time, but we were wondering how often are releases made for 6.x?
We've got a project to launch but this segmentation fault is preventing us from going live.
If you have any ideas (i.e days, weeks, months) that will allow us to plan and potentially consider workarounds (such as trying a different varnish cache version perhaps).
Thank you.
How frequently is this happening?
If it is a recurring panic, are you able to reproduce it?
Or could you give the 7.2.1 release a try? Confirming whether this happens with the latest would be really helpful.
It's very sporadic unfortunately, so we haven't been able to reproduce it in any meaningful way.
We're deploying our application on AWS and have to browse around a lot of pages and make lots of requests before we're able to trigger it.
Our issue is that we're testing this on a staging environment, so without pushing to production and testing under real load we're really unsure how frequently this is going to happen.
We'll check if the AWS AMIs we are using have access to to 7.x, but I'm guessing they might not as otherwise we would have likely used the latest version available.
Do you have an idea of what the issue might be from the panic above?
No idea yet, I only had a brief look at the code.
@lrotherfield-function is away until Monday, he'll try to test upgrading to 7.2.x next week!
@MikeParkin @Dridi I will look to get 7.2.1 installed. I installed 6.0.11 as it is the most recent LTS and is currently still supported.
Hey @Dridi
I installed Varnish 7.2.1 and unfortunately I still got what I believe is the same issue:
Child (15401) Panic at: Wed, 30 Nov 2022 07:21:01 GMT
Wrong turn at cache/cache_main.c:323:
Signal 11 (Segmentation fault) received at 0xffff767fe000 si_code 2
version = varnish-7.2.1 revision 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, vrt api = 16.0
ident = Linux,4.14.290-217.505.amzn2.aarch64,aarch64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 135068.821812 (mono), 1669792861.746616 (real)
Backtrace:
0x449230: /usr/sbin/varnishd() [0x449230]
0x4c41dc: /usr/sbin/varnishd(VAS_Fail+0x58) [0x4c41dc]
0x443d48: /usr/sbin/varnishd() [0x443d48]
0xffff8d2d1668: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff8d2d1668]
0xffff8cf59540: /lib64/libc.so.6(memset+0x100) [0xffff8cf59540]
0x4d7188: /usr/sbin/varnishd(deflateReset+0x48) [0x4d7188]
0x4378f0: /usr/sbin/varnishd(VGZ_NewGzip+0x88) [0x4378f0]
0x437ba4: /usr/sbin/varnishd() [0x437ba4]
0x436e84: /usr/sbin/varnishd(VFP_Open+0xbc) [0x436e84]
0x433ec4: /usr/sbin/varnishd() [0x433ec4]
^ Let me know if you need more of the output than this, I can provide it all if required
Below is my varnish.service file if that is helpful in the debugging:
[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target nss-lookup.target
[Service]
Type=forking
KillMode=mixed
# Maximum number of open files (for ulimit -n)
LimitNOFILE=131072
# Shared memory (VSM) segments are tentatively locked in memory. The
# default value for vsl_space (or shorthand varnishd -l option) is 80MB.
# There are other types of segments that would benefit from allowing
# more memory to be locked.
LimitMEMLOCK=100M
# Enable this to avoid "fork failed" on reload.
TasksMax=infinity
# Maximum size of the corefile.
LimitCORE=infinity
# A PID file makes the main process selection deterministic.
RuntimeDirectory=%N
PIDFile=%t/%N/varnishd.pid
# Maximum size of the corefile.
ExecStart=/usr/sbin/varnishd \
-a :80 \
-a localhost:8443,PROXY \
-T 127.0.0.1:6082 \
-P %t/%N/varnishd.pid \
-p feature=+http2 \
-p thread_pool_stack=512k \
-f /etc/varnish/default.vcl \
-s malloc,2G
ExecReload=/usr/sbin/varnishreload
[Install]
WantedBy=multi-user.target
A bit of additional information that is maybe not relevant but just in case: When I upgraded to 7.2.1 from 6.0.11 I started getting the following panic until I removed the startup option that was creating a filesystem storage backend:
-s static=file,/var/lib/varnish/varnish_storage.bin,1G
Child (1674) Panic at: Mon, 28 Nov 2022 17:26:56 GMT
Assert error in sml_trimstore(), storage/storage_simple.c line 542:
Condition((oc->boc->stevedore_priv) == 0) not true.
version = varnish-7.2.1 revision 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, vrt api = 16.0
ident = Linux,4.14.290-217.505.amzn2.aarch64,aarch64,-junix,-smalloc,-sfile,-sdefault,-hcritbit,epoll
now = 291.299220 (mono), 1669656416.205785 (real)
Backtrace:
0x449230: /usr/sbin/varnishd() [0x449230]
0x4c41dc: /usr/sbin/varnishd(VAS_Fail+0x58) [0x4c41dc]
0x4a34f0: /usr/sbin/varnishd() [0x4a34f0]
0x445f8c: /usr/sbin/varnishd(ObjExtend+0x108) [0x445f8c]
0x433b10: /usr/sbin/varnishd() [0x433b10]
0x4a392c: /usr/sbin/varnishd() [0x4a392c]
0x445c04: /usr/sbin/varnishd(ObjIterate+0x64) [0x445c04]
0x433870: /usr/sbin/varnishd() [0x433870]
0x435b60: /usr/sbin/varnishd() [0x435b60]
0x471034: /usr/sbin/varnishd() [0x471034]
This was not happening in 6.0.11
Thanks a lot, this is really helpful. The other panic seems to be the same as #3816 that I happen to have labeled for bugwash just yesterday. It's already been a couple bugwash sessions since #3816 got updated without us having a look at the new data.
Thanks @Dridi. We’re blocked now on our project launch, so if there is a way to fix it or if you have any ideas for a workaround that would be massively appreciated.
Hi @Dridi
I changed my instance type on AWS and I get a slightly different backtrace although deflate is still in it:
Child (1643) Panic at: Wed, 30 Nov 2022 21:37:44 GMT
Wrong turn at cache/cache_main.c:323:
Signal 11 (Segmentation fault) received at 0xffff8e3fe000 si_code 2
version = varnish-7.2.1 revision 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, vrt api = 16.0
ident = Linux,4.14.290-217.505.amzn2.aarch64,aarch64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 13437.401452 (mono), 1669844264.362791 (real)
Backtrace:
0x449230: /usr/sbin/varnishd() [0x449230]
0x4c41dc: /usr/sbin/varnishd(VAS_Fail+0x58) [0x4c41dc]
0x443d48: /usr/sbin/varnishd() [0x443d48]
0xffffa4e5d668: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffa4e5d668]
0xffffa4aecd88: /lib64/libc.so.6(+0x85d88) [0xffffa4aecd88]
0x4d5d2c: /usr/sbin/varnishd() [0x4d5d2c]
0x4d643c: /usr/sbin/varnishd() [0x4d643c]
0x4d692c: /usr/sbin/varnishd() [0x4d692c]
0x4d7450: /usr/sbin/varnishd(deflate+0x260) [0x4d7450]
0x438608: /usr/sbin/varnishd(VGZ_Gzip+0xa0) [0x438608]
This was done on an m6g.large EC2 instance, whereas the previous traces are from a t4g.xlarge. I don't know if that trace changes anything or helps narrow anything down? If you have any suggestions that would be great.
I am going to try a c7g instance as they have graviton 3 processors rather than graviton 2, I have no idea if that would help but I am willing to try most things right now :).
Would you rebuild the latest 6.0 release from source to hopefully get more meaningful backtraces?
@Dridi yes that is fine, is there anything I should do other than follow https://varnish-cache.org/docs/6.6/installation/install_source.html to make sure the backtraces are more verbose?
These instructions should be fine to grab build dependencies, except that:
- you shouldn't work from a git clone
- download https://varnish-cache.org/downloads/varnish-6.0.11.tgz instead
- you should skip the autogen part and go straight with configure
Hey @Dridi, I ran make check and pretty much everything failed after the aXXXXX.vtc tests. Here is failed part of the first failing test B00000:
*** v1 0.3 CLI RX 200
**** v1 0.3 CLI RX|VCL compiled.
**** v1 0.3 CLI TX|vcl.use vcl1
*** v1 0.3 CLI RX 200
** v1 0.3 Start
**** v1 0.3 CLI TX|start
*** v1 0.3 debug|Debug: Child (18867) Started
*** v1 0.3 debug|Error: Child (18867) not responding to CLI, killed it.
*** v1 0.3 debug|Error: Child (18867) Pushing vcls failed:
*** v1 0.3 debug|CLI communication error (hdr)
*** v1 0.3 debug|Debug: Stopping Child
*** v1 0.3 debug|Error: Child (18867) died signal=6
*** v1 0.3 debug|Error: Child (18867) Panic at: Tue, 06 Dec 2022 19:59:02 GMT
*** v1 0.3 debug|Wrong turn at cache/cache_main.c:284:
*** v1 0.3 debug|Signal 6 (Aborted) received at 0x37e000049b3 si_code -6
*** v1 0.3 debug|version = varnish-6.0.11 revision a3bc025c2df28e4a76e10c2c41217c9864e9963b, vrt api = 7.1
*** v1 0.3 debug|ident = Linux,4.14.290-217.505.amzn2.aarch64,aarch64,-junix,-sdefault,-sdefault,-hcritbit,epoll
*** v1 0.3 debug|now = 110735.563847 (mono), 1670356742.553192 (real)
*** v1 0.3 debug|Backtrace:
*** v1 0.3 debug| 0x440070: varnishd() [0x440070]
*** v1 0.3 debug| 0x4ab014: varnishd(VAS_Fail+0x54) [0x4ab014]
*** v1 0.3 debug| 0x43aba4: varnishd() [0x43aba4]
*** v1 0.3 debug| 0xffff9a104668: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffff9a104668]
*** v1 0.3 debug| 0xffff99d70530: /lib64/libc.so.6(gsignal+0xb0) [0xffff99d70530]
*** v1 0.3 debug| 0xffff99d71d74: /lib64/libc.so.6(abort+0x154) [0xffff99d71d74]
*** v1 0.3 debug| 0xffff9a093414: /lib64/libjemalloc.so.1(+0x11414) [0xffff9a093414]
*** v1 0.3 debug| 0xffff9a093524: /lib64/libjemalloc.so.1(+0x11524) [0xffff9a093524]
*** v1 0.3 debug| 0xffff9a093064: /lib64/libjemalloc.so.1(+0x11064) [0xffff9a093064]
*** v1 0.3 debug| 0xffff9a08db34: /lib64/libjemalloc.so.1(+0xbb34) [0xffff9a08db34]
*** v1 0.3 debug|errno = 22 (Invalid argument)
*** v1 0.3 debug|thread = (pool_herder)
*** v1 0.3 debug|thr.req = (nil) {
*** v1 0.3 debug|},
*** v1 0.3 debug|thr.busyobj = (nil) {
*** v1 0.3 debug|},
*** v1 0.3 debug|vmods = {
*** v1 0.3 debug|},
*** v1 0.3 debug|
*** v1 0.3 debug|
*** v1 0.3 debug|Info: Child (18867) said Child starts
*** v1 0.3 debug|Info: Child (18867) said <jemalloc>: Error in munmap(): Invalid argument
*** v1 0.3 debug|Debug: Child cleanup complete
I'm not exactly sure what caused the panic but it looks like there is a jemalloc issue, is that something that could be causing our other issues?
Here is the test summary:
============================================================================
Testsuite summary for Varnish 6.0.11
============================================================================
# TOTAL: 789
# PASS: 63
# SKIP: 26
# XFAIL: 0
# FAIL: 700
# XPASS: 0
# ERROR: 0
Can you please rebuild from source and at configure time use the following flags?
./configure --without-jemalloc --with-unwind
@Dridi I built varnish 6 from source on both servers yesterday and as of yet have not had a panic, I will leave the servers running for another couple of days with requests being made to make sure, but it currently is looking good.
It might be time to disable jemalloc support by default. The one version we know works really well with Varnish is 3.6 but on more recent operating systems shipping more recent versions of jemalloc it seems to show up in multiple seemingly unrelated tickets.
yum list jemalloc
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
223 packages excluded due to repository priority protections
Installed Packages
jemalloc.aarch64 3.6.0-1.el7 @epel
:D we have 3.6 installed unfortunately
Let me rephrase then, 3.6 on x86_64 is what we know for sure works well.
@Dridi if this does turn out to be fine without Jemalloc I assume I will have to build from source and can't use the repositories?
I don't do much package building so I just want to make sure I understand the implications of this. Our servers are built by ElasticBeanstalk so I am going to have to write a script to build varnish if that is the case.
Let's wait a little longer to see whether you run into the panic again, meanwhile I submitted #3881 for review and intend to apply it to 6.0 as well if that reaches consensus.
Thank you for your patience and your help.
@Dridi looking at the logs from the past 4 days I cannot see any panics. Is there anything we can do to get #3881 accepted? I am trying to get an idea of when I might be able to get my current project live :)
For now I am rebuilding the stack using x86_64 processor instances. Let me know if there are any other tests I can run to help get arm processors supported
Could you please try another build with jemalloc 3.6?
./configure --with-jemalloc --with-unwind --enable-debugging-symbols
And then run varnishd with this jemalloc configuration in your environment:
export MALLOC_CONF="abort:true,redzone:true,junk:true"