neon icon indicating copy to clipboard operation
neon copied to clipboard

test_change_pageserver is unstable due to async reload signal handling

Open alexanderlaw opened this issue 8 months ago • 1 comments
trafficstars

Multiple failures of test_change_pageserver, e. g.: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10993/13587168793/index.html#/testresult/940f78b1b81ded4a test_change_pageserver[release-pg16] / X64 / __sanitizers: 'disabled'

https://neon-github-public-dev.s3.amazonaws.com/reports/main/13316989014/index.html#/testresult/e03d8f42343a4def test_change_pageserver[release-pg16] / ARM64 / __sanitizers: 'disabled'

https://neon-github-public-dev.s3.amazonaws.com/reports/main/13720657425/index.html#/testresult/328dcbe2aa327d95 test_change_pageserver[release-pg14] / ARM64 / __sanitizers: 'disabled'

https://neon-github-public-dev.s3.amazonaws.com/reports/main/13824996212/index.html#/testresult/5611bc191d85599 test_change_pageserver[release-pg17] / ARM64 / __sanitizers: 'enabled'

with the following diagnostics:



test_runner/regress/test_change_pageserver.py:91: in test_change_pageserver
    connstring = fetchone()
test_runner/regress/test_change_pageserver.py:56: in fetchone
    assert all(result == results[0] for result in results)
E   assert False
E    +  where False = all(<generator object test_change_pageserver.<locals>.fetchone.<locals>.<genexpr> at 0xffd43561dc40>)

with the corresponding test fragment:

    def fetchone():
        results = [cur.fetchone() for cur in curs]
        assert all(result == results[0] for result in results)
        return results[0]
...
    endpoint.reconfigure(pageserver_id=alt_pageserver_id)

    # Verify that the neon.pageserver_connstring GUC is set to the correct thing
    execute("SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'")
    connstring = fetchone()

indicate that the test might fail because of asynchronous "reconfigure" processing.

With the patches pg_settings-async-debug.patch.txt, test_change_pageserver.patch.txt applied, the test fails on each run for me, with such messages in test.log:

    def fetchone():
        results = []
        for cur in curs:
            res = cur.fetchone()
            log.info(f"!!!res: {res}")
            results.append(res)
>       assert all(result == results[0] for result in results)
E       assert False
E        +  where False = all(<generator object test_change_pageserver.<locals>.fetchone.<locals>.<genexpr> at 0x7005fc7fbbc0>)
...
2025-03-16 14:32:33.239 INFO [test_change_pageserver.py:58] !!!res: ('postgresql://no_user@localhost:15005',)
2025-03-16 14:32:33.239 INFO [test_change_pageserver.py:58] !!!res: ('postgresql://no_user@localhost:15007',)
2025-03-16 14:32:33.239 INFO [test_change_pageserver.py:58] !!!res: ('postgresql://no_user@localhost:15007',)
---------------------------- Captured log teardown -----------------------------
2025-03-16 14:32:33.382 INFO [neon_fixtures.py:942] Cleaning up all storage and compute nodes

Note that the first connection string differs from others.

alexanderlaw avatar Mar 16 '25 14:03 alexanderlaw

This failure is still happening: https://neon-github-public-dev.s3.amazonaws.com/reports/main/15289357837/index.html#/testresult/33be51a12a39d822 5/28/2025 4:49:14 – 4:49:31

test_runner/regress/test_change_pageserver.py:79: in test_change_pageserver
    connstring = fetchone()
test_runner/regress/test_change_pageserver.py:44: in fetchone
    assert all(result == results[0] for result in results)
E   assert False
E    +  where False = all(<generator object test_change_pageserver.<locals>.fetchone.<locals>.<genexpr> at 0xff2786acbbc0>)

https://neon-github-public-dev.s3.amazonaws.com/reports/main/15406417966/index.html#/testresult/7f9cec989573c082 6/3/2025 4:52:00 – 4:52:17

test_runner/regress/test_change_pageserver.py:79: in test_change_pageserver
    connstring = fetchone()
test_runner/regress/test_change_pageserver.py:44: in fetchone
    assert all(result == results[0] for result in results)
E   assert False
E    +  where False = all(<generator object test_change_pageserver.<locals>.fetchone.<locals>.<genexpr> at 0xffeef18c9e00>)

alexanderlaw avatar Jun 03 '25 07:06 alexanderlaw

This issue was moved to Jira: LKB-1778

zenithdb-bot-dev[bot] avatar Jul 21 '25 09:07 zenithdb-bot-dev[bot]