[Bug] A new BLPOP calls fails to return any results for a key after cancelling a previous python task waiting on a BLPOP for that key
Describe the bug
This is a continuation of bug https://github.com/valkey-io/valkey-glide/issues/4591
Instead of going in to a busy spin, the new code fails to return any results for a BLPOP on a specific key, if a previous BLPOP for that key was waiting in a python task that was cancelled.
Expected Behavior
The python wrapper for BLPOP returns the value that the rust library has received
Current Behavior
BLPOP hangs (timeout=0) or times out if timeout > 0 - but it never returns the value
Reproduction Steps
Unit test to reproduce.
async def test_glide_runtime_thread_stale_writer():
queue_name = "test_glide_runtime_thread_stale_writer"
# Setup writer connection
client_config: GlideClusterClientConfiguration = GlideClusterClientConfiguration(
addresses=[NodeAddress(host="localhost", port=6379)],
use_tls=False,
)
write_connection = await GlideClusterClient.create(client_config)
# function to get 1 blpop response or timeout after 10secs
async def get_blpop_value() -> list[bytes] | None:
blocking_connection = await GlideClusterClient.create(client_config)
try:
return await blocking_connection.blpop([queue_name], timeout=float(10))
except Exception as e:
logger.error(f"Error in get_blpop_value: {e}")
await blocking_connection.close()
return None
# Confirm valid blocking behavior
blocking_task1 = asyncio.create_task(get_blpop_value())
await asyncio.sleep(3) # allow context switch
await write_connection.lpush(queue_name, ["######Test1######"])
result1 = await blocking_task1
logger.debug(f"Result1: {result1}")
assert result1 is not None
assert result1[1] == b"######Test1######"
# Trigger Failure mode - cancel task that is waiting on blpop
# comment this 5 line block out to see a passing test
blocking_task2 = asyncio.create_task(get_blpop_value())
await asyncio.sleep(3) # allow context switch
blocking_task2.cancel()
while not blocking_task2.done():
await asyncio.sleep(0.1)
# Test failure mode
blocking_task3 = asyncio.create_task(get_blpop_value())
await asyncio.sleep(3) # allow context switch
await write_connection.lpush(queue_name, ["@@@@@@Test2@@@@@@"])
result3 = await blocking_task3
logger.debug(f"Result3: {result3}")
assert result3 is not None
assert result3[1] == b"@@@@@@Test2@@@@@@"
Possible Solution
I've added a bunch of debug to the rust share lib today, I suspect the socket writer is not cleaned up correctly after the BLPOP is cancelled. The newly received key/value is being sent to the old writer and not to the newly opened fd.... but I'm learning rust as I go, so needs new eyes.
I can confirm that the value I am expecting the final BLPOP to return is correctly read by the rust library (logged in glide-core/src/socket_listeners.rs) - just never makes it to the python client over the unix socket.
Additional Information/Context
No response
Client version used
Local debug build from todays near latest commit c45c5a06e12a0dbbd3b2d281fbb0c84e3aca9e40
Engine type and version
valkey/valkey:8.1.1-alpine
OS
PC using WSL and AWS ECS container based on python:3.10-slim
Language
Python
Language Version
3.10.18
Cluster information
docker-compose:
services:
valkey-cluster:
image: valkey/valkey:8.1.1-alpine
container_name: valkey-cluster
command: >
valkey-server
--port 6379
--cluster-enabled yes
--cluster-config-file nodes.conf
--cluster-node-timeout 5000
--cluster-announce-port 6379
--cluster-announce-bus-port 16379
--appendonly no
ports:
- "6379:6379"
networks:
- valknet
cluster-init:
image: valkey/valkey:8.1.1-alpine
depends_on:
- valkey-cluster
entrypoint: >
/bin/sh -c "
sleep 3 &&
echo 'yes' | valkey-cli --cluster create valkey-cluster:6379 --cluster-yes
"
networks: [valknet]
networks:
valknet:
driver: bridge
Logs
valkey-cli MONITOR logs for the above unit test. The @@@@@@Test2@@@@@@ string never arrives in the python client but can be seen to arrive in the rust shared lib.
1756898040.209504 [0 172.18.0.1:35574] "HELLO" "3"
1756898040.209934 [0 172.18.0.1:35584] "HELLO" "3"
1756898040.210772 [0 172.18.0.1:35584] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.210793 [0 172.18.0.1:35584] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.210893 [0 172.18.0.1:35574] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.210897 [0 172.18.0.1:35574] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.211169 [0 172.18.0.1:35574] "PING"
1756898040.211517 [0 172.18.0.1:35584] "CLIENT" "SETNAME" "glide_management_connection"
1756898040.213884 [0 172.18.0.1:35584] "CLUSTER" "SLOTS"
1756898040.216931 [0 172.18.0.1:35584] "PING"
1756898040.217009 [0 172.18.0.1:35574] "PING"
1756898040.221202 [0 172.18.0.1:35592] "HELLO" "3"
1756898040.221302 [0 172.18.0.1:35600] "HELLO" "3"
1756898040.221659 [0 172.18.0.1:35592] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.221665 [0 172.18.0.1:35592] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.221884 [0 172.18.0.1:35600] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.221898 [0 172.18.0.1:35600] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.222357 [0 172.18.0.1:35592] "PING"
1756898040.222738 [0 172.18.0.1:35600] "CLIENT" "SETNAME" "glide_management_connection"
1756898040.223288 [0 172.18.0.1:35600] "CLUSTER" "SLOTS"
1756898040.223963 [0 172.18.0.1:35600] "PING"
1756898040.223970 [0 172.18.0.1:35592] "PING"
1756898040.226686 [0 172.18.0.1:35592] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898043.223466 [0 172.18.0.1:35574] "LPUSH" "test_glide_runtime_thread_stale_writer" "######Test1######"
1756898043.230209 [0 172.18.0.1:58962] "HELLO" "3"
1756898043.230421 [0 172.18.0.1:58960] "HELLO" "3"
1756898043.231002 [0 172.18.0.1:58962] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898043.231014 [0 172.18.0.1:58962] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898043.231074 [0 172.18.0.1:58960] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898043.231077 [0 172.18.0.1:58960] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898043.231713 [0 172.18.0.1:58960] "PING"
1756898043.232157 [0 172.18.0.1:58962] "CLIENT" "SETNAME" "glide_management_connection"
1756898043.232726 [0 172.18.0.1:58962] "CLUSTER" "SLOTS"
1756898043.233358 [0 172.18.0.1:58962] "PING"
1756898043.233365 [0 172.18.0.1:58960] "PING"
1756898043.235381 [0 172.18.0.1:58960] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898046.333481 [0 172.18.0.1:58972] "HELLO" "3"
1756898046.333808 [0 172.18.0.1:58982] "HELLO" "3"
1756898046.334165 [0 172.18.0.1:58972] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898046.334171 [0 172.18.0.1:58972] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898046.334234 [0 172.18.0.1:58982] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898046.334238 [0 172.18.0.1:58982] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898046.334763 [0 172.18.0.1:58972] "PING"
1756898046.335181 [0 172.18.0.1:58982] "CLIENT" "SETNAME" "glide_management_connection"
1756898046.335852 [0 172.18.0.1:58982] "CLUSTER" "SLOTS"
1756898046.336485 [0 172.18.0.1:58982] "PING"
1756898046.336541 [0 172.18.0.1:58972] "PING"
1756898046.337961 [0 172.18.0.1:58972] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898049.333610 [0 172.18.0.1:35574] "LPUSH" "test_glide_runtime_thread_stale_writer" "@@@@@@Test2@@@@@@"
Other information
No response
Thank you @a1higgins-oss for creating this follow up issue! Will look into it and update soon.
Hi James, any chance you've managed to reproduce? We're hoping for a fix before the 17th as this blocks a major move to Elasticache-valkey for us.
Sorry @a1higgins-oss , with 2.1 release so close, development capacity is tight and it's unlikely we could squeeze in a fix for this one.
I'll see what I can do but don't keep your hope high.
Update: still WIP on finding the root cause. As it didn't make it as part of release 2.1, treating it as 2.2 candidate, unless a quick & simple fix is found, which would qualify it to be in one of the patch releases in the near future.
@jamesx-improving Can you confirm you've managed to reproduce the issue or if you need any more info?
@jamesx-improving Can you confirm if you've managed to reproduce the issue or if you need any more info?