confusion due to invisible VSL store overflow
I am in the process of analyzing this weird symptom, opening an issue in case this rings a bell anywhere or anyone has helpful input:
It looks like VSL queries for SessClose ~ RX_TIME were broken specifically. I do not think this is the actual cause, but it looks like it:
- collect a varnish log to ensure we are looking at complete data
$ varnishlog -g raw -w complete_log
- look for sessions with timeouts
(proxy protocol information redacted)
$ varnishlog -r complete_log -q 'SessClose ~ RX_TIME'| head -20
* << Session >> 55264167
- Begin sess 0 PROXY
- SessOpen 0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.762403 983
- Proxy 2 A.B.C.D 1234 W.X.Y.Z 4567
- SessClose RX_TIMEOUT 10.000
- End
* << Session >> 58439167
- Begin sess 0 PROXY
- SessOpen 0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.770828 302
- Proxy 2 A.B.C.D 1234 W.X.Y.Z 4567
- Link req 58439168 rxreq
- SessClose RX_TIMEOUT 10.005
- End
- try to get details about those: FAIL, no output
$ varnishlog -r complete_log -g session -q 'SessClose ~ RX_TIME'
$
- session grouping without a query works fine
$ varnishlog -r complete_log -g session|head -3
* << Session >> 49799445
- Begin sess 0 PROXY
- SessOpen 0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.776863 119
- querying another
SessClosereason works
$ varnishlog -r complete_log -g session -q 'SessClose ~ RESP_CLO'|head -3
* << Session >> 49799445
- Begin sess 0 PROXY
- SessOpen 0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.776863 119
- the linked requests from sessions (e.g. 58439167) exist:
$ varnishlog -r complete_log -q 'vxid == 58439168' | head -3
* << Request >> 58439168
- Begin req 58439167 rxreq
- Timestamp Start: 1580823086.774641 0.000000 0.000000
right after openening the ticket I had an idea leading further:
$ varnishlog -r complete_log -g session -q 'vxid == 58439167'
* << Session >> 58439167
- Begin sess 0 PROXY
- SessOpen 0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.770828 302
- Proxy 2 A.B.C.D 1234 W.X.Y.Z 4567
- Link req 58439168 rxreq
- VSL store overflow
- End synth
so the relevant bit is the store overflow, what is happening is that the session grouping never gets to see the SessClose for the timeout case. adding -L 10000 helps getting some output.
Should we maybe refactor the way we report the store overflow to ensure that the error is visible?
bugwash:
- Add callback to VSL API at every segment switch if transactions had been dropped
- add
varnishlogparameter to report drops every x seconds (0 = off, default 5)
this is major release material due to the required api chance
while at it, we might also want to improve error reporting in general, see #3357 we could have pointers to varnish-faq(7) in confusing error messages