varnish-cache confusion due to invisible VSL store overflow

I am in the process of analyzing this weird symptom, opening an issue in case this rings a bell anywhere or anyone has helpful input:

It looks like VSL queries for SessClose ~ RX_TIME were broken specifically. I do not think this is the actual cause, but it looks like it:

collect a varnish log to ensure we are looking at complete data

$ varnishlog -g raw -w complete_log

look for sessions with timeouts

(proxy protocol information redacted)

$ varnishlog -r complete_log -q 'SessClose ~  RX_TIME'| head -20
*   << Session  >> 55264167  
-   Begin          sess 0 PROXY
-   SessOpen       0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.762403 983
-   Proxy          2 A.B.C.D 1234 W.X.Y.Z 4567
-   SessClose      RX_TIMEOUT 10.000
-   End            

*   << Session  >> 58439167  
-   Begin          sess 0 PROXY
-   SessOpen       0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.770828 302
-   Proxy          2 A.B.C.D 1234 W.X.Y.Z 4567
-   Link           req 58439168 rxreq
-   SessClose      RX_TIMEOUT 10.005
-   End

try to get details about those: FAIL, no output

$ varnishlog -r complete_log -g session -q 'SessClose ~  RX_TIME'   
$

session grouping without a query works fine

$ varnishlog -r complete_log -g session|head -3
*   << Session  >> 49799445  
-   Begin          sess 0 PROXY
-   SessOpen       0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.776863 119

querying another SessClose reason works

$ varnishlog -r complete_log -g session -q 'SessClose ~ RESP_CLO'|head -3
*   << Session  >> 49799445  
-   Begin          sess 0 PROXY
-   SessOpen       0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.776863 119

the linked requests from sessions (e.g. 58439167) exist:

$ varnishlog -r complete_log -q 'vxid == 58439168' | head -3
*   << Request  >> 58439168  
-   Begin          req 58439167 rxreq
-   Timestamp      Start: 1580823086.774641 0.000000 0.000000

Feb 04 '20 13:02 nigoroll

right after openening the ticket I had an idea leading further:

$ varnishlog -r complete_log -g session -q 'vxid == 58439167'
*   << Session  >> 58439167  
-   Begin          sess 0 PROXY
-   SessOpen       0.0.0.0 0 uds_proxy 0.0.0.0 0 1580823086.770828 302
-   Proxy          2 A.B.C.D 1234 W.X.Y.Z 4567
-   Link           req 58439168 rxreq
-   VSL            store overflow
-   End            synth

so the relevant bit is the store overflow, what is happening is that the session grouping never gets to see the SessClose for the timeout case. adding -L 10000 helps getting some output.

Should we maybe refactor the way we report the store overflow to ensure that the error is visible?

Feb 04 '20 14:02 nigoroll

bugwash:

Add callback to VSL API at every segment switch if transactions had been dropped
add varnishlog parameter to report drops every x seconds (0 = off, default 5)

Feb 10 '20 15:02 nigoroll

this is major release material due to the required api chance

Feb 26 '20 14:02 nigoroll

while at it, we might also want to improve error reporting in general, see #3357 we could have pointers to varnish-faq(7) in confusing error messages

Jun 29 '20 13:06 nigoroll