directors locking: CLI ops vs. VRT_DelDirector / VRT_AddDirector
This tickets comes from a very true comment by @bsdphk https://github.com/varnishcache/varnish-cache/issues/3094#issuecomment-548274037
Wasn't there something about CLI wandering the backends getting in the way of traffic when creation of dynamic backends got stuck behind CLI's holding the mutex ?
VCL_IterDirector() (rightly) holds the vcl_mtx while it is generating backend information
https://github.com/varnishcache/varnish-cache/blob/8e7b0f01d7c2bb0ae57c016f8bfb01d1fbd00dcb/bin/varnishd/cache/cache_vcl.c#L319
So CLI commands block all director operations.
For all practical purposes, I do not think this is too much of an issue because CLI writes happen into a VSB. So as long as the director list and healthy callbacks do not conduct any time consuming or potentially blocking operations, the potential impact on traffic should be rather limited.
So the first question is: Do we want to do anything about it? Please provide feedback.
If yes, my preferred option would be to solve this issue through director references (CLI gaining references for the duration of the operation), see #2725 . Other options may include to always put directors on a cool list when deleting, so the CLI operations can walk a copy of the director list safely outside the lock.
I want to keep this issue separate from #3094 - because that ticket concerns another lock which we do not actually need and which does not solve this problem either.
The cool list idea would probably generalize VRT_delete_backend() to directors
https://github.com/varnishcache/varnish-cache/blob/8e7b0f01d7c2bb0ae57c016f8bfb01d1fbd00dcb/bin/varnishd/cache/cache_backend.c#L646
I am really not happy about the cool-list, primarily because it is optimistic.
I tried adding director refcounts cheaply some weeks ago, but it became messy so I gave up.
We may have to simply bite that bullet.
I still think that the previous refusal of my director refcount suggestion was based on a wrong assumption, and I still stink that the suggestion I made is cheap. I shall go back to that PR and revive it.
Is this ticket serving any purpose any more ?
Is everybody happy with my suggestion https://github.com/varnishcache/varnish-cache/issues/3119#issuecomment-551034495 ?
I find it hard to work with reference counting since that would force VMODs to opt in and not mess everything up.
I currently see no future-proof way to handle this, and speaking strictly about the CLI holding the VCL lock, this can become really ugly in setups with tons of dynamic backends. Maybe we should consider moving away from a tail queue to something that better handles the many-backends case, but that too is its own challenge because of our glob-based backend iteration.
Maybe we should have a "separate" lock to add backends, and then a CLI "before" action to commit them in the "main" backend list to let VCL make progress when the CLI is busy iterating over it.
@Dridi I believe my previous PR addressed all your concerns, except for the fact, that it used one additional lock on the same level as the vcl_mtx. I think that additional lock can in fact be avoided and the operations put into VCL_Ref() / VCL_Rel().
Feel free to have a look at the old PR #2725 but please keep in mind the above.