valkey-glide icon indicating copy to clipboard operation
valkey-glide copied to clipboard

[Bug] A new BLPOP calls fails to return any results for a key after cancelling a previous python task waiting on a BLPOP for that key

Open a1higgins-oss opened this issue 3 months ago • 6 comments

Describe the bug

This is a continuation of bug https://github.com/valkey-io/valkey-glide/issues/4591

Instead of going in to a busy spin, the new code fails to return any results for a BLPOP on a specific key, if a previous BLPOP for that key was waiting in a python task that was cancelled.

Expected Behavior

The python wrapper for BLPOP returns the value that the rust library has received

Current Behavior

BLPOP hangs (timeout=0) or times out if timeout > 0 - but it never returns the value

Reproduction Steps

Unit test to reproduce.

async def test_glide_runtime_thread_stale_writer():
    queue_name = "test_glide_runtime_thread_stale_writer"

    # Setup writer connection
    client_config: GlideClusterClientConfiguration = GlideClusterClientConfiguration(
        addresses=[NodeAddress(host="localhost", port=6379)],
        use_tls=False,
    )
    write_connection = await GlideClusterClient.create(client_config)

    # function to get 1 blpop response or timeout after 10secs
    async def get_blpop_value() -> list[bytes] | None:
        blocking_connection = await GlideClusterClient.create(client_config)
        try:
            return await blocking_connection.blpop([queue_name], timeout=float(10))
        except Exception as e:
            logger.error(f"Error in get_blpop_value: {e}")
            await blocking_connection.close()
            return None

    # Confirm valid blocking behavior
    blocking_task1 = asyncio.create_task(get_blpop_value())
    await asyncio.sleep(3)  # allow context switch
    await write_connection.lpush(queue_name, ["######Test1######"])
    result1 = await blocking_task1
    logger.debug(f"Result1: {result1}")

    assert result1 is not None
    assert result1[1] == b"######Test1######"

    # Trigger Failure mode - cancel task that is waiting on blpop
    # comment this 5 line block out to see a passing test
    blocking_task2 = asyncio.create_task(get_blpop_value())
    await asyncio.sleep(3)  # allow context switch
    blocking_task2.cancel()
    while not blocking_task2.done():
        await asyncio.sleep(0.1)

    # Test failure mode
    blocking_task3 = asyncio.create_task(get_blpop_value())
    await asyncio.sleep(3)  # allow context switch
    await write_connection.lpush(queue_name, ["@@@@@@Test2@@@@@@"])
    result3 = await blocking_task3
    logger.debug(f"Result3: {result3}")

    assert result3 is not None
    assert result3[1] == b"@@@@@@Test2@@@@@@"

Possible Solution

I've added a bunch of debug to the rust share lib today, I suspect the socket writer is not cleaned up correctly after the BLPOP is cancelled. The newly received key/value is being sent to the old writer and not to the newly opened fd.... but I'm learning rust as I go, so needs new eyes.

I can confirm that the value I am expecting the final BLPOP to return is correctly read by the rust library (logged in glide-core/src/socket_listeners.rs) - just never makes it to the python client over the unix socket.

Additional Information/Context

No response

Client version used

Local debug build from todays near latest commit c45c5a06e12a0dbbd3b2d281fbb0c84e3aca9e40

Engine type and version

valkey/valkey:8.1.1-alpine

OS

PC using WSL and AWS ECS container based on python:3.10-slim

Language

Python

Language Version

3.10.18

Cluster information

docker-compose:

services:

    valkey-cluster:
      image: valkey/valkey:8.1.1-alpine
      container_name: valkey-cluster
      command: >
        valkey-server
        --port 6379
        --cluster-enabled yes
        --cluster-config-file nodes.conf
        --cluster-node-timeout 5000
        --cluster-announce-port 6379
        --cluster-announce-bus-port 16379
        --appendonly no
      ports:
        - "6379:6379"
      networks:
        - valknet

    cluster-init:
      image: valkey/valkey:8.1.1-alpine
      depends_on:
        - valkey-cluster
      entrypoint: >
        /bin/sh -c "
        sleep 3 &&
        echo 'yes' | valkey-cli --cluster create valkey-cluster:6379 --cluster-yes
        "
      networks: [valknet]

networks:
  valknet:
    driver: bridge

Logs

valkey-cli MONITOR logs for the above unit test. The @@@@@@Test2@@@@@@ string never arrives in the python client but can be seen to arrive in the rust shared lib.

1756898040.209504 [0 172.18.0.1:35574] "HELLO" "3"
1756898040.209934 [0 172.18.0.1:35584] "HELLO" "3"
1756898040.210772 [0 172.18.0.1:35584] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.210793 [0 172.18.0.1:35584] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.210893 [0 172.18.0.1:35574] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.210897 [0 172.18.0.1:35574] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.211169 [0 172.18.0.1:35574] "PING"
1756898040.211517 [0 172.18.0.1:35584] "CLIENT" "SETNAME" "glide_management_connection"
1756898040.213884 [0 172.18.0.1:35584] "CLUSTER" "SLOTS"
1756898040.216931 [0 172.18.0.1:35584] "PING"
1756898040.217009 [0 172.18.0.1:35574] "PING"
1756898040.221202 [0 172.18.0.1:35592] "HELLO" "3"
1756898040.221302 [0 172.18.0.1:35600] "HELLO" "3"
1756898040.221659 [0 172.18.0.1:35592] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.221665 [0 172.18.0.1:35592] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.221884 [0 172.18.0.1:35600] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898040.221898 [0 172.18.0.1:35600] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898040.222357 [0 172.18.0.1:35592] "PING"
1756898040.222738 [0 172.18.0.1:35600] "CLIENT" "SETNAME" "glide_management_connection"
1756898040.223288 [0 172.18.0.1:35600] "CLUSTER" "SLOTS"
1756898040.223963 [0 172.18.0.1:35600] "PING"
1756898040.223970 [0 172.18.0.1:35592] "PING"
1756898040.226686 [0 172.18.0.1:35592] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898043.223466 [0 172.18.0.1:35574] "LPUSH" "test_glide_runtime_thread_stale_writer" "######Test1######"
1756898043.230209 [0 172.18.0.1:58962] "HELLO" "3"
1756898043.230421 [0 172.18.0.1:58960] "HELLO" "3"
1756898043.231002 [0 172.18.0.1:58962] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898043.231014 [0 172.18.0.1:58962] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898043.231074 [0 172.18.0.1:58960] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898043.231077 [0 172.18.0.1:58960] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898043.231713 [0 172.18.0.1:58960] "PING"
1756898043.232157 [0 172.18.0.1:58962] "CLIENT" "SETNAME" "glide_management_connection"
1756898043.232726 [0 172.18.0.1:58962] "CLUSTER" "SLOTS"
1756898043.233358 [0 172.18.0.1:58962] "PING"
1756898043.233365 [0 172.18.0.1:58960] "PING"
1756898043.235381 [0 172.18.0.1:58960] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898046.333481 [0 172.18.0.1:58972] "HELLO" "3"
1756898046.333808 [0 172.18.0.1:58982] "HELLO" "3"
1756898046.334165 [0 172.18.0.1:58972] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898046.334171 [0 172.18.0.1:58972] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898046.334234 [0 172.18.0.1:58982] "CLIENT" "SETINFO" "LIB-NAME" "GlidePy"
1756898046.334238 [0 172.18.0.1:58982] "CLIENT" "SETINFO" "LIB-VER" "unknown"
1756898046.334763 [0 172.18.0.1:58972] "PING"
1756898046.335181 [0 172.18.0.1:58982] "CLIENT" "SETNAME" "glide_management_connection"
1756898046.335852 [0 172.18.0.1:58982] "CLUSTER" "SLOTS"
1756898046.336485 [0 172.18.0.1:58982] "PING"
1756898046.336541 [0 172.18.0.1:58972] "PING"
1756898046.337961 [0 172.18.0.1:58972] "BLPOP" "test_glide_runtime_thread_stale_writer" "10.0"
1756898049.333610 [0 172.18.0.1:35574] "LPUSH" "test_glide_runtime_thread_stale_writer" "@@@@@@Test2@@@@@@"

Other information

No response

a1higgins-oss avatar Sep 03 '25 11:09 a1higgins-oss

Thank you @a1higgins-oss for creating this follow up issue! Will look into it and update soon.

jamesx-improving avatar Sep 03 '25 17:09 jamesx-improving

Hi James, any chance you've managed to reproduce? We're hoping for a fix before the 17th as this blocks a major move to Elasticache-valkey for us.

a1higgins-oss avatar Sep 09 '25 00:09 a1higgins-oss

Sorry @a1higgins-oss , with 2.1 release so close, development capacity is tight and it's unlikely we could squeeze in a fix for this one.

I'll see what I can do but don't keep your hope high.

jamesx-improving avatar Sep 13 '25 01:09 jamesx-improving

Update: still WIP on finding the root cause. As it didn't make it as part of release 2.1, treating it as 2.2 candidate, unless a quick & simple fix is found, which would qualify it to be in one of the patch releases in the near future.

jamesx-improving avatar Sep 23 '25 18:09 jamesx-improving

@jamesx-improving Can you confirm you've managed to reproduce the issue or if you need any more info?

a1higgins-oss avatar Sep 28 '25 23:09 a1higgins-oss

@jamesx-improving Can you confirm if you've managed to reproduce the issue or if you need any more info?

a1higgins-oss avatar Nov 13 '25 03:11 a1higgins-oss