503 from (probably) failed allocation after upgrade from 6.x to 7.7
Since updating to 7.7.0(debian testing) from 6.x we are getting a spew of 503's that happens days after the cache is filled up
/usr/sbin/varnishd -j unix,user=vcache -F -a 0.0.0.0:80 -T localhost:6082 -f /etc/varnish/default.vcl -l 81920k -p thread_pool_min=80 -p thread_pool_max=4000 -p thread_pools=2 -p default_ttl=1209600 -p nuke_limit=1000 -p max_restarts=4 -S /etc/varnish/secret -s default=malloc,8569M -s mm=malloc,17139M
logs look like that
* << Request >> 583226464
- Begin req 583226327 rxreq
- Timestamp Start: 1761649198.288374 0.000000 0.000000
- Timestamp Req: 1761649198.288374 0.000000 0.000000
- VCL_use boot
- ReqStart 10.0.0.1 64710 a0
- ReqMethod GET
- ReqURL /video/abcd.mp4
- ReqProtocol HTTP/1.1
- ReqHeader host: example.com
- ReqHeader sec-ch-ua-platform: ...
- ReqHeader accept-encoding: ...
- ReqHeader user-agent: ...
- ReqHeader accept: */*
- ReqHeader range: bytes=44325-15139796
- ReqHeader if-range: "063fccff745ae4b743f8053ba65c0f7d-2"
- ReqHeader priority: i
- ReqHeader x-forwarded-for: 1.2.3.4
- ReqUnset x-forwarded-for: 1.2.3.4
- ReqHeader X-Forwarded-For: 1.2.3.4, 10.0.1.1
- ReqHeader Via: 1.1 cache3 (Varnish/7.7)
- VCL_call RECV
- ReqUnset accept-encoding: identity;q=1, *;q=0
- ReqURL /video/abcd.mp4
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- VCL_call MISS
- VCL_return fetch
- Link bereq 583226465 fetch
- Timestamp Fetch: 1761649198.310064 0.021689 0.021689
- RespProtocol HTTP/1.1
- RespStatus 503
- RespReason Service Unavailable
- RespHeader Date: Tue, 28 Oct 2025 10:59:58 GMT
- RespHeader Server: Varnish
- RespHeader X-Varnish: 583226464
- VCL_call SYNTH
- RespHeader Content-Type: text/html; charset=utf-8
- RespHeader Retry-After: 5
- VCL_return deliver
- Timestamp Process: 1761649198.310122 0.021748 0.000058
- RespHeader Content-Length: 283
- Storage malloc Transient
- Filters
- RespHeader Connection: keep-alive
- Timestamp Resp: 1761649198.310186 0.021811 0.000063
- ReqAcct 711 0 711 213 283 496
- End
and when requests started failing I noticed both rise of the allocation fails in the mm pool (which handled big objects), and drop in lru nukes
that (so far, haven't run it enough to happen on other pool) only happens for big objects and only after few days of running after the pool have been full
Steps to Reproduce (for bugs)
No idea, we run it as front of S3 cache for files from small css to video
Varnish Cache version
7.7.0 (Debian testing)
Operating system
No response
This has been discussed during bugwash. Some questions/notes:
There are two cases when c_fail of a malloc stevedore is increased:
We can not directly differentiate the two cases by the counter alone. The second case, however, would be root-caused in jemalloc not returning memory; and in many other places where we call malloc(), it is followed by an assertion. So if we were looking at a regression in jemalloc or integration thereof, you would very likely see panics, because in this case other allocation requests were also likely to fail.
Assuming that you do not see panics and rather only the 503s, the next probable cause if that the cache was filled and not enough room could be made until nuke_limit was reached.
It would be very helpful to get a backend varnishlog (-b or -g request option) of the failing backend request to clarify that this was indeed the case.
This behavior has not changed from 6.x to 7.7, at least not significantly and at least not that we would be aware. Hence we suspect that the reason why you are seeing these errors is not a change in the code base, but rather in the usage pattern.
That said, there are some things you could consider checking/changing/improving:
At the marked time on your graphs, the lru_nuked rate drops to zero. This probably means that there were no objects left to remove from cache, which in turn means that all objects were "held onto" by ongoing requests. This could simply mean that your cache is too small for your workload and/or objects too big, or it could mean that you are not "seeing" client disconnects early enough. In this context, consider lowering send_timeout and/or idle_send_timeout and configuring the tcp_keepalive_* parameters.
You are running in an "LRU only" configuration, with long TTLs and virtually no ttl-based expiry. The built-in storages have a fundamental design problem around nuke_limit, which is addressed by alternative storage engines, one of which is SLASH/. Its documentation has a section on the LRU fairness problem.
As a special case, note that the malloc storage engine supports "larger than ram" objects only for uncachable objects and, for most real use cases, only if transit_buffer is configured. So one specific cause of the issue you are seeing could be that you are trying to serve objects which are larger than (a significant portion of) your storage. As alternative options, file storage allows you to serve objects larger than RAM by using kernel-provided paging to disk/ssd and fellow from the aforementioned SLASH/ project achieves the same goal with a second LRU layer and also offers persistent storage.
I'll try to look at the logs and try some of the proposed options after I get from vacation, thanks for your time :)
Objects are basically movie trailers so nothing remotely close to the size of the memory, and we divided RAM into 2 pools to avoid the "big object have to evict tens of thousands of small objects" that otherwise happened.
Assuming that you do not see panics and rather only the 503s, the next probable cause if that the cache was filled and not enough room could be made until nuke_limit was reached.
I did not see them at moment of writing, but since then I got another event when it did panic
~250 sessions running concurrently at that time but given the hitrate I'd expect most of them were on same bunch of latests objects.
Regarding the original issue of this ticket: You could have run into the issue if concurrent requests delivered objects with a total size exceeding the memory.
The panic in the log is:
Nov 2 21:53:36 cache3 varnishd[419114]: Error: Child (868913) Panic at: Sun, 02 Nov 2025 20:53:36 GMT
Nov 2 21:53:36 cache3 varnishd[419114]: Assert error in VBF_Fetch(), cache/cache_fetch.c line 1245:
Nov 2 21:53:36 cache3 varnishd[419114]: Condition(((oc->flags & OC_F_FAILED)) != 0) not true.
... which is unrelated and very likely the issue fixed with f03919325dc75d3cae8c80d66209d4f174469381
@nigoroll hello, I have some questions about the cache eviction mechanism (LRU) in Varnish 7.5: Varnish Configuration Details (Varnish 7.5)
1.1 /usr/lib/systemd/system/varnish.service
[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target nss-lookup.target
[Service]
Type=forking
KillMode=mixed
LimitNOFILE=131072
# ... (other OS limits)
ExecStart=/usr/sbin/varnishd \
-a :60081 \
-T :60082 \
-S /etc/varnish/secret \
-a localhost:8443,PROXY \
-p thread_pools=4 \
-p nuke_limit=5000 \
-p shortlived=0.100 \
-p thread_pool_min=100 \
-p thread_pool_max=5000 \
-p thread_pool_timeout=120 \
-p tcp_keepalive_probes=3 \
-p http_range_support=on \
-f /etc/varnish/business.vcl \
-P /run/varnish/varnishd.pid \
-p feature=+http2 \
-p vsl_space=1G \
-s mem=malloc,20G \
-s disk=file,/data1/data/services/varnish/storage.bin,50G
ExecReload=/usr/sbin/varnishreload
[Install]
WantedBy=multi-user.target
1.2 varnishadm param.show:
accept_filter -
acceptor_sleep_decay 0.9 (default)
acceptor_sleep_incr 0.000 [seconds] (default)
acceptor_sleep_max 0.050 [seconds] (default)
auto_restart on [bool] (default)
backend_idle_timeout 60.000 [seconds] (default)
backend_local_error_holddown 10.000 [seconds] (default)
backend_remote_error_holddown 0.250 [seconds] (default)
ban_cutoff 0 [bans] (default)
ban_dups on [bool] (default)
ban_lurker_age 60.000 [seconds] (default)
ban_lurker_batch 1000 (default)
ban_lurker_holdoff 0.010 [seconds] (default)
ban_lurker_sleep 0.010 [seconds] (default)
between_bytes_timeout 60.000 [seconds] (default)
cc_command exec gcc -std=gnu11 -std=gnu11 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -DZ_PREFIX -fno-var-tracking-assignments %w -pthread -fpic -shared -Wl,-x -o %o %s (default)
cc_warnings -Wall -Werror -Wno-error=unused-result (default)
cli_limit 64k [bytes] (default)
cli_timeout 60.000 [seconds] (default)
clock_skew 10 [seconds] (default)
clock_step 1.000 [seconds] (default)
connect_timeout 3.500 [seconds] (default)
critbit_cooloff 180.000 [seconds] (default)
debug none (default)
default_grace 10.000 [seconds] (default)
default_keep 0.000 [seconds] (default)
default_ttl 120.000 [seconds] (default)
experimental none (default)
feature none,+http2,+validate_headers,+vcl_req_reset
fetch_chunksize 16k [bytes] (default)
fetch_maxchunksize 0.25G [bytes] (default)
first_byte_timeout 60.000 [seconds] (default)
gzip_buffer 32k [bytes] (default)
gzip_level 6 (default)
gzip_memlevel 8 (default)
h2_header_table_size 4k [bytes] (default)
h2_initial_window_size 65535b [bytes] (default)
h2_max_concurrent_streams 100 [streams] (default)
h2_max_frame_size 16k [bytes] (default)
h2_max_header_list_size 2147483647b [bytes] (default)
h2_rapid_reset 1.000 [seconds] (default)
h2_rapid_reset_limit 100 (default)
h2_rapid_reset_period 60.000 [seconds] (default)
h2_rx_window_increment 1M [bytes] (default)
h2_rx_window_low_water 10M [bytes] (default)
h2_rxbuf_storage Transient (default)
h2_window_timeout 5.000 [seconds] (default)
http1_iovs 64 [struct iovec (=16 bytes)] (default)
http_gzip_support on [bool] (default)
http_max_hdr 64 [header lines] (default)
http_range_support on [bool] (default)
http_req_hdr_len 8k [bytes] (default)
http_req_size 32k [bytes] (default)
http_resp_hdr_len 8k [bytes] (default)
http_resp_size 32k [bytes] (default)
idle_send_timeout 60.000 [seconds] (default)
listen_depth 1024 [connections] (default)
lru_interval 2.000 [seconds] (default)
max_esi_depth 5 [levels] (default)
max_restarts 4 [restarts] (default)
max_retries 4 [retries] (default)
max_vcl 100 (default)
max_vcl_handling 1 (default)
nuke_limit 5000 [allocations]
pcre2_depth_limit 20 (default)
pcre2_jit_compilation on [bool] (default)
pcre2_match_limit 10000 (default)
ping_interval 3 [seconds] (default)
pipe_sess_max 0 [connections] (default)
pipe_task_deadline 0.000 [seconds] (default)
pipe_timeout 60.000 [seconds] (default)
pool_req 10,100,10 (default)
pool_sess 10,100,10 (default)
pool_vbo 10,100,10 (default)
prefer_ipv6 off [bool] (default)
rush_exponent 3 [requests per request] (default)
send_timeout 600.000 [seconds] (default)
shortlived 0.100 [seconds]
sigsegv_handler on [bool] (default)
startup_timeout 0.000 [seconds] (default)
syslog_cli_traffic on [bool] (default)
tcp_fastopen off [bool] (default)
tcp_keepalive_intvl 75.000 [seconds] (default)
tcp_keepalive_probes 3 [probes]
tcp_keepalive_time 60.000 [seconds] (default)
thread_pool_add_delay 0.000 [seconds] (default)
thread_pool_destroy_delay 1.000 [seconds] (default)
thread_pool_fail_delay 0.200 [seconds] (default)
thread_pool_max 5000 [threads] (default)
thread_pool_min 100 [threads] (default)
thread_pool_reserve 0 [threads] (default)
thread_pool_stack 80k [bytes] (default)
thread_pool_timeout 120.000 [seconds]
thread_pool_watchdog 60.000 [seconds] (default)
thread_pools 4 [pools]
thread_queue_limit 20 [requests] (default)
thread_stats_rate 10 [requests] (default)
timeout_idle 5.000 [seconds] (default)
timeout_linger 0.050 [seconds] (default)
transit_buffer 0b [bytes] (default)
vary_notice 10 [variants] (default)
vcc_feature none,+err_unref,+unsafe_path (default)
vcl_cooldown 600.000 [seconds] (default)
vcl_path /etc/varnish:/usr/share/varnish/vcl (default)
vmod_path /usr/lib64/varnish/vmods (default)
vsl_buffer 16k [bytes] (default)
vsl_mask all,-Debug,-ObjProtocol,-ObjStatus,-ObjReason,-ObjHeader,-ExpKill,-WorkThread,-Hash,-VfpAcct,-H2RxHdr,-H2RxBody,-H2TxHdr,-H2TxBody,-VdpAcct (default)
vsl_reclen 255b [bytes] (default)
vsl_space 1G [bytes]
vsm_free_cooldown 60.000 [seconds] (default)
workspace_backend 96k [bytes] (default)
workspace_client 96k [bytes] (default)
workspace_session 0.75k [bytes] (default)
workspace_thread 2k [bytes] (default)
1.3 caching.vcl
// 接收阶段缓存决策 (vcl_recv cache decision)
sub caching_recv_decision {
// ... HTML/JSON pass
// ... non-GET/HEAD pass
// ... /api/ pass
// Static assets allow hash (cache lookup)
if (req.url ~ "\.(css|js|jpg|jpeg|gif|png|ico|webp|svg|woff|woff2|ttf|eot|mov|mp4|webm|mp3|ogg|pdf|zip)(\?.*)?$") {
unset req.http.Cookie;
return (hash);
}
// Optimization 1: Dynamic content bypass (pass)
return (pass);
}
// 后端响应缓存决策 (vcl_backend_response cache decision)
sub caching_backend_decision {
// Static assets long TTL logic
if (bereq.url ~ "\.(css|js|jpg|jpeg|gif|png|ico|webp|svg|woff|woff2|ttf|eot|mov|mp4|webm|mp3|ogg|pdf|zip)(\?.*)?$") {
set beresp.ttl = 30d;
// Optimization 2: 60MB disk split
if (std.integer(beresp.http.Content-Length, 0) > 62914560) {
set beresp.http.X-Cache-Storage = "disk";
set beresp.storage = storage.disk; // Write to 50G disk
} else {
set beresp.http.X-Cache-Storage = "memory";
set beresp.storage = storage.mem; // Write to 20G memory
}
}
// Optimization 1 effect: Non-static content security check
else {
set beresp.uncacheable = true;
return (deliver);
}
// ... (other error/size checks)
}
// Hash逻辑 (Hash Logic)
sub caching_hash_logic {
// Optimization 3: Go-param caching for static files
if (req.url ~ "\.(css|js|jpg|jpeg|gif|png|ico|webp|svg|woff|woff2|ttf|eot|mov|mp4|webm|mp3|ogg|pdf|zip)(\?.*)?$") {
hash_data(regsub(req.url, "\?.*$", ""));
} else {
// Dynamic content uses full URL
hash_data(req.url);
}
hash_data(req.http.host);
}
"My Varnish node frequently experiences memory filling up completely. After the memory becomes full, subsequent requests begin to fail, evidenced by the statistic $\mathbf{c_fail > 0}$ (allocation failures). This then triggers the LRU cleanup mechanism, but the process only evicts a small portion of objects at a time, which fails to solve the root problem, and the memory fills up again shortly after.My question is: Are there any parameters in Varnish 7.5 that can be set to trigger the LRU cleanup mechanism when memory utilization exceeds a specific threshold, for example, $\mathbf{80%}$? Alternatively, do you have any other optimization strategies that can fundamentally resolve this issue of the memory constantly filling up?"
Re @waynebzhang Your report sounds similar to what has been discussed in this ticket and I have nothing to add to my previous note.
I will edit your post because it contains a lot of detail