valkey icon indicating copy to clipboard operation
valkey copied to clipboard

CLIENT UNBLOCK should't be able to unblock client that doesn't has timeout callback

Open enjoy-binbin opened this issue 7 months ago • 1 comments

When a client is blocked by something like CLIENT PAUSE, we should not allow CLIENT UNBLOCK timeout to unblock it, since some blocking types does not has the timeout callback, it will trigger a panic in the core, people should use CLIENT UNPAUSE to unblock it.

Also using CLIENT UNBLOCK error is not right, it will return a UNBLOCKED error to the command, people don't expect a SET command to get an error.

So in this commit, in these cases, we will return 0 to CLIENT UNBLOCK to indicate the unblock is fail. The reason is that we assume that if a command doesn't expect to be timedout, it also doesn't expect to be unblocked by CLIENT UNBLOCK.

The old behavior of the following command will trigger panic in timeout and get UNBLOCKED error in error. Under the new behavior, client unblock will get the result of 0.

client 1> client pause 100000 write
client 2> set x x

client 1> client unblock 2 timeout
or
client 1> client unblock 2 error

Potentially breaking change, previously allowed CLIENT UNBLOCK error. Fixes #2111.

enjoy-binbin avatar May 22 '25 04:05 enjoy-binbin

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.21%. Comparing base (36b51c9) to head (4db45f8). Report is 38 commits behind head on unstable.

Files with missing lines Patch % Lines
src/blocked.c 85.71% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2117      +/-   ##
============================================
+ Coverage     71.19%   71.21%   +0.02%     
============================================
  Files           122      122              
  Lines         66046    66053       +7     
============================================
+ Hits          47020    47039      +19     
+ Misses        19026    19014      -12     
Files with missing lines Coverage Δ
src/networking.c 87.29% <100.00%> (+0.10%) :arrow_up:
src/server.h 100.00% <ø> (ø)
src/blocked.c 91.31% <85.71%> (-0.12%) :arrow_down:

... and 13 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar May 22 '25 04:05 codecov[bot]

Let's backport it to all versions, ok?

Or only backport the client unblock timeout case where it could crash?

zuiderkwast avatar May 26 '25 11:05 zuiderkwast

Let's backport it to all versions, ok?

Or only backport the client unblock timeout case where it could crash?

I think cherry picking the exact code would be easier and not a problem. I am not sure why the error is ever used (as I am also not sure when is this API ever used in realtime scenarios and not testing)

ranshid avatar May 26 '25 12:05 ranshid

I wonder if the change is more severe than it appears at first glance. One could implement the "monitor a changing set of keys" idea from the documentation of CLIENT UNBLOCK as follows:

Loop forever:

  1. BLPOP <current set of observed keys> 0
  2. If not timed out: work on result

And in parallel:

Trigger when set of keys changes:

  1. Update current set of observed keys
  2. CLIENT UNBLOCK <client id>

Key property today: Client unblocking works even during CLIENT PAUSE, ensuring the key set gets updated on next iteration. (Note that there is no need to even look at the reply of CLIENT UNBLOCK)

The proposed change in this PR creates ambiguity in the CLIENT UNBLOCK response:

"0" reply could now mean either:

  1. Client wasn't blocked
  2. Client couldn't be unblocked due to pause (new meaning)

This will break the code above if the trigger occurs when BLPOP is blocking due to CLIENT PAUSE: The BLPOP will continue to wait using the outdated set of keys even if the clients will be unpaused later.

I still think that unblocking the client is wrong (as said by @zuiderkwast CLIENT PAUSE and blocking commands are two different concepts). But maybe CLIENT UNBLOCK should return an error if the command can't be unblocked due to a CLIENT PAUSE? (or should CLIENT UNBLOCK block?)

gmbnomis avatar May 26 '25 15:05 gmbnomis

@gmbnomis Good point. It seems that we should unblock the client if it's blocked and paused at the same time. I.e. it should be possible to unblock a BLPOP even during client pause write. Makes sense?

zuiderkwast avatar May 26 '25 16:05 zuiderkwast

It seems that we should unblock the client if it's blocked and paused at the same time. I.e. it should be possible to unblock a BLPOP even during client pause write.

That's an interesting perspective. Basically, because pausing clients and blocking commands are different concepts from a user's perspective, we could argue that the user does not need to be concerned with the underlying reason why a blocking command is blocked.

So yes, it makes sense that CLIENT UNBLOCK will just unblock it. And, as this neither reads or writes, this does no harm even if done during a read-write pause. (but I suppose it will be a pain to implement this behavior)

OTOH, my scenario from above turns out to be not fully correct: Today, the server panics when a CLIENT UNBLOCK TIMEOUT is issued for a client that is paused (even if it is a blocking command). Only a CLIENT UNBLOCK ERROR works. So, I wonder if anybody is actually using this pattern...

gmbnomis avatar May 26 '25 19:05 gmbnomis

Today, the server panics when a CLIENT UNBLOCK TIMEOUT is issued for a client that is paused (even if it is a blocking command).

Yes, I can confirm this. This is an even worse bug IMO. It means the server can crash when you try to CLIENT UNBLOCK a client that is blocking on BLPOP. The user may not know that CLIENT PAUSE is in action, for example during a failover or something like that.

The fix prevents the crash, but unfortunately it prevents unblocking BLPOP while clients are paused, but this depends on which was first: block or pause.

  • If a command was blocked before paused, it is possible to unblock it using CLIENT UNBLOCK.
  • If clients were paused before a blocking command was called and blocked, it's not possible to unblock it using CLIENT UNBLOCK.

zuiderkwast avatar May 27 '25 13:05 zuiderkwast

The fix prevents the crash, but unfortunately it prevents unblocking BLPOP while clients are paused, but this depends on which was first: block or pause.

* If a command was blocked before paused, it is possible to unblock it using CLIENT UNBLOCK.

* If clients were paused before a blocking command was called and blocked, it's not possible to unblock it using CLIENT UNBLOCK.

@zuiderkwast Yes, this is true. I think that none of the alternatives to this behavior is optimal:

  1. Unblock clients regardless of pause/block state
    • Pros: Consistent unblocking behavior.
    • Cons: When clients are paused early in command processing, we lack the necessary context (command type, expected reply) to generate a proper timeout response, especially for module commands. This would likely require a new module API to communicate this information.
  2. Return an error when paused
    • Pros: Simple to implement.
    • Cons: Unexpected for clients, as they may not anticipate this error.
  3. Postpone CLIENT UNBLOCK of a paused command until the pause expires/is lifted
    • Pros: From the client’s perspective, this is similar to option 1 but with some delay—in certain cases (like WAIT), this may be the most logical behavior.
    • Cons: This introduces a (new?) edge case where one command is deferred by another. We’d need to ensure that commands are unblocked and executed in the correct order, but since blocked commands are already managed in a linked list, this might be feasible.

Overall, option 3 makes most sense to me.

gmbnomis avatar Jun 02 '25 18:06 gmbnomis

All options have pros and cons. :thinking: They also require more work, and these cases are very unusual. Maybe there is no use case for these combinations?

IMO we can merge this PR. It is simple and it prevents the crash, which is the most important problem to fix. OK?

zuiderkwast avatar Jun 02 '25 19:06 zuiderkwast

@gmbnomis Thank you for your summary comment @valkey-io/core-team Please also take a look at the comments above and vote.

enjoy-binbin avatar Jun 03 '25 02:06 enjoy-binbin

The fix prevents the crash, but unfortunately it prevents unblocking BLPOP while clients are paused, but this depends on which was first: block or pause.

* If a command was blocked before paused, it is possible to unblock it using CLIENT UNBLOCK.

* If clients were paused before a blocking command was called and blocked, it's not possible to unblock it using CLIENT UNBLOCK.

@zuiderkwast Yes, this is true. I think that none of the alternatives to this behavior is optimal:

  1. Unblock clients regardless of pause/block state

    • Pros: Consistent unblocking behavior.
    • Cons: When clients are paused early in command processing, we lack the necessary context (command type, expected reply) to generate a proper timeout response, especially for module commands. This would likely require a new module API to communicate this information.
  2. Return an error when paused

    • Pros: Simple to implement.
    • Cons: Unexpected for clients, as they may not anticipate this error.
  3. Postpone CLIENT UNBLOCK of a paused command until the pause expires/is lifted

    • Pros: From the client’s perspective, this is similar to option 1 but with some delay—in certain cases (like WAIT), this may be the most logical behavior.
    • Cons: This introduces a (new?) edge case where one command is deferred by another. We’d need to ensure that commands are unblocked and executed in the correct order, but since blocked commands are already managed in a linked list, this might be feasible.

Overall, option 3 makes most sense to me.

I think that (as I think you also state) the ambiguity was always there. when we pause clients some commands with timeout are not fulfilling the timeout guarantee since it is just too hard to do so. For example we cannot timeout commands which are just sitting in the incoming socket stream as we did not yet parse them. I think this PR is fine in solving the immediate problem. I would also think that extending CLIENT PAUSE with an extra optional argument (client-id) would be a better choice to extend the API to be more clear and allow unblocking a specific paused client

ranshid avatar Jun 03 '25 09:06 ranshid

Let's merge it so it can be included in

  • #2173

zuiderkwast avatar Jun 07 '25 02:06 zuiderkwast