varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

cache_vcl: Do the temperature dance for backend creation races

Open nigoroll opened this issue 1 year ago • 4 comments

Dynamic backends may get created while we transition a VCL to the WARM temperature: Because we first set the new temperature and then post the event, backends might get created when the temperature is already WARM, but before the VCL_EVENT_WARM gets posted, which leads to double warming. This can be demonstrated with an appropriate delay during vcl_send_event().

The solution to this problem is simple: Only post WARM events to backends from before the start of the transition.

In code, it looks a little more involved, but it is not complicated:

Under the vcl mutex, we take all directors onto a separate "cold" list and change the temperature, and send the warm event only to the "cold" directors. When all are warm, we concatenate.

To undo, we basically do the inverse: Save the warm directors, put the cold ones back, send a cold event to all warm ones and concatenate again.

Fixes #4199

nigoroll avatar Sep 30 '24 18:09 nigoroll

Sorry, if anyone was real quick they might have caught a little glitch. I force-pushed a correction.

nigoroll avatar Sep 30 '24 19:09 nigoroll

bugwash: leave more elaborate explanation on the race here

nigoroll avatar Oct 07 '24 13:10 nigoroll

@bsdphk as promised, the more elaborate explanation of the race:

https://github.com/varnishcache/varnish-cache/blob/305be79c8bc2bab4abb30a9c167c2d5bb35144ce/bin/varnishd/cache/cache_vcl.c#L632-L634

when, after this point, VRT_AddDirector() gets called from another thread, the new director is added to the director list and the first VCL_EVENT_WARM is posted here:

https://github.com/varnishcache/varnish-cache/blob/305be79c8bc2bab4abb30a9c167c2d5bb35144ce/bin/varnishd/cache/cache_vrt_vcl.c#L242-L245

Then, the second event gets posted from here:

https://github.com/varnishcache/varnish-cache/blob/305be79c8bc2bab4abb30a9c167c2d5bb35144ce/bin/varnishd/cache/cache_vcl.c#L637

Regarding your question on IRC:

(15:50:38) phk: also, new backends are added to the tail of the list, so we could just break the loop first time we see an already warm backend, no ?

Yes, I think this would be possible, if we added the current temperature to directors. Yet it would conflict with another goal, namely to not hold the vcl_mtx while posting the actual event in vcl_BackendEvent(). The reason for this is #4140, addressed by #4142: We have a deadlock-by-design issue when deleting dynamic backends.

So, in short: Yes, we could avoid this problem dancing less but deadlocking more.

nigoroll avatar Oct 07 '24 15:10 nigoroll

Set to draft because #4142 should be done first.

nigoroll avatar Oct 10 '24 09:10 nigoroll

Now that #4142 is done, I hope to get back to this eventually

nigoroll avatar Jan 22 '25 18:01 nigoroll

With #4142 in place, I think there is a better way, so I will close this PR and add a replacement.

nigoroll avatar Feb 04 '25 16:02 nigoroll