squid icon indicating copy to clipboard operation
squid copied to clipboard

Bug 3565: Resuming postponed accept kills Squid

Open yadij opened this issue 2 years ago • 10 comments

Coding Guideline: "never use raw-pointers to AsyncJob"

AcceptLimiter was violating this with stored pointers to TcpAcceptor instances. Fix this by storing Call objects instead.

The AsyncCall Dialer handles Job removal and pointless Calls already. Drop the removeDead() and kick() loop as no longer necessary. The memory optimization they were providing is not possible with current AsyncCall API, but may be restored when the TODO about adding an AsyncCall::canDial() member is enacted.

yadij avatar Jul 07 '21 11:07 yadij

Since 69bb939, AcceptLimiter stores CbcPointers, not raw pointers. AFAICT, it validates those pointers before dereferencing them (except in removeDead() called from the job object itself). Could you please detail the bug you are trying to fix with these changes?

The issue I am trying to fix is that the AcceptLimiter::kick() call to TcpAcceptor::acceptNext() does not have AsyncCall protections for proper exception handling (principally Must() inside the TcpAcceptor logic).

yadij avatar Jul 07 '21 20:07 yadij

The issue I am trying to fix is that the AcceptLimiter::kick() call to TcpAcceptor::acceptNext() does not have AsyncCall protections for proper exception handling

Understood. Thank you for disclosing this. The part of this PR that changes AcceptLimiter registrations from job pointers to async callbacks looks (conceptually) correct to me. Keep the (adjusted) removeDead() method though -- its "unregister me" purpose is unrelated to the bug you are fixing. TcpAcceptor should store its registered callback so that it can call removeDead() with it (instead of this). Drop the prose about raw pointers -- they are not the issue here.

The above is all straightforward. The biggest problem we will have to solve here is the okToAccept() loop in kick(). Once we go async, that condition in that loop stops working -- it will never change from true to false. There are several easy "solutions" here, including a simple replacement of the loop with an if statement (as was done in this PR), but they do not work (as you know by now).

The best solution I can suggest replaces the explicit while loop with an implicit one, done via a special TcpAcceptor callback data member. When that member is destroyed, it unconditionally calls AcceptLimiter::kick(). Some of those calls will be in vain -- either the AcceptLimiter queue will be empty or okToAccept() will be false and, hence, kick() will notify nobody. That is OK IMO: Squid is working at the FD limit when all this happens, so things already went very wrong, and we should trade an extra method call for code simplicity and correctness. Do not be tempted to "optimize" this by scheduling multiple acceptor notifications at once.

I can sketch the callback code with that special member if you would like, but it will take me some time to do that -- too busy fixing other bugs for the upcoming v5 release. I should not even be writing all this now, but I was worried about the many good-looking but not-working solutions that exist in this space. HTH.

rousskov avatar Jul 07 '21 22:07 rousskov

One other implementation would be to have each TcpAcceptor maintain its own deferred queue instead of a global AcceptLimiter instance. The removeDead() would be replaced by normal RAII erasure when the TcpAcceptor destructed. Then kick() and okToAccept() check after each accepts(2) success.

You see any flaws with that design?

yadij avatar Jul 08 '21 15:07 yadij

One other implementation would be to have each TcpAcceptor maintain its own deferred queue instead of a global AcceptLimiter instance. You see any flaws with that design?

I see two flaws:

  1. A single TcpAcceptor cannot have a queue. That job is either deferred (not listening/accepting) or accepting.

  2. We need to know which deferred TcpAcceptor to awake when an FD slot becomes available. For that, we need a queue of deferred TcpAcceptors awaiting slots. That is exactly what AcceptLimiter is...

rousskov avatar Jul 08 '21 16:07 rousskov

Alex: The best solution I can suggest replaces the explicit while loop with an implicit one, done via a special TcpAcceptor callback data member.

I found a much simpler implementation of the same implicit loop idea: AcceptLimiter::kick() must always schedule either zero or two async calls. If there are no free FD slots or no deferred jobs, the method schedules nothing, as usual. If there are free FD slots, the method tries to schedule a job call. If it is successful at that, the method also schedules a call to itself. Problem solved.

BTW, if you prefer, you can keep TcpAcceptors registered as jobs. Registering AsyncCalls is probably a better long-term approach, but you can solve the problem you are trying to solve without changing the current registration approach (by creating an async call when the AcceptLimiter is ready to kick the registered and still valid TcpAcceptor job).

rousskov avatar Jul 08 '21 16:07 rousskov

Genius. So simple its brilliant. Thank you! I will implement that latest and push in a day or so for a new review.

yadij avatar Jul 10 '21 03:07 yadij

@rousskov, the kick issue resolved in 6800f53, so I am not sure what semaphore is having issues with now. I would like to keep the defer() API change, but can revert that part for now it fixes up the Semaphore issues.

yadij avatar Jul 11 '21 23:07 yadij

I think I found the issue and a good fix. Will push as soon as a full set of build tests here finish.

yadij avatar Jul 12 '21 09:07 yadij

Status update: There are two paths forward for this bug fix.

  1. reviewer accepts this change as "Good Enough" to go in without extending scope to a complete redesign of the TcpAcceptor.
  2. waiting on PR #870 completion.

yadij avatar Feb 29 '24 09:02 yadij

Status update: There are two paths forward for this bug fix.

  1. reviewer accepts this change as "Good Enough" to go in

AFAICT, nothing has changed to affect my earlier position: The current PR metadata and code are (still) not good enough.

Please do not misinterpret my response as an agreement (or disagreement) with the assertion that there are only two paths forward for addressing bug 3565 (or solving any other problem that this PR is trying to solve).

rousskov avatar Feb 29 '24 14:02 rousskov