unfurl icon indicating copy to clipboard operation
unfurl copied to clipboard

Fix test_server.py for Python 3.14: robust IPv4 binding and error surfacing

Open Copilot opened this issue 3 months ago • 13 comments

  • [x] Add wait_for_status helper function near top of tests/test_server.py to poll until expected status code or timeout
  • [x] Replace direct assertion assert res.status_code == 304 in test_server_export_remote for the application-blueprint test with call to wait_for_status
  • [x] Also replace the direct assertion for the dashboard cache hit test (the loop at "cache hit for" case)
  • [x] Add comment explaining the change is to make test robust against async cache population

Combined changes from both PR #352 (robust IPv4 binding and error surfacing) and PR #355 (wait_for_status helper for async cache timing).

Rebased cleanly against testing branch.

Changes from PR #352:

  • Explicit IPv4 binding: Added HOST = "127.0.0.1" constant to avoid getaddrinfo resolution differences across platforms
  • Error queue for child tracebacks: New serve_server wrapper forwards exceptions to a multiprocessing.Queue, enabling parent process to surface actual server crash reasons
  • Robust connection polling: Replaced urllib-based health check with socket.create_connection against both IPv4/IPv6, increased timeout from 2s to 12s
  • Updated all Process creation sites: Use get_context() and attach error queue to fixture, set_up_deployment, test_server_export_local, test_server_export_remote
  • Updated all request URLs: Use HOST instead of hardcoded 127.0.0.1

Changes from PR #355:

  • wait_for_status helper: Polls HTTP endpoint until expected status code or timeout elapses
  • Async cache timing tolerance: Replaced direct assert res.status_code == 304 assertions with wait_for_status(expected=304, timeout=15.0) calls
Original prompt

Problem

The py3.14 CI job is failing with ConnectionRefusedError in tests/test_server.py::test_server_version because the test helper that starts the server (start_server_process) is fragile: it polls only briefly, uses urllib against "localhost" (subject to IPv6/IPv4 resolution differences), and does not surface server-side exceptions when the child process exits. That hides the root cause when the server process crashes or binds to a different address family under Python 3.14.

Goal

Create a PR that makes tests/test_server.py robust so the test harness:

  • consistently binds/connects to IPv4 (127.0.0.1) by default,
  • uses a robust start_server_process that polls for a TCP listener across families and longer timeout,
  • captures and surfaces server-side Python tracebacks from the child process via a multiprocessing.Queue so failing starts show a clear error in test output,
  • uses multiprocessing.get_context() to create processes and attaches the error queue to the Process object so the waiter can inspect it.

Files to change

  • tests/test_server.py (ref ee68f046b52254c0357754576c888b89de848c5d)

Detailed changes to apply

  1. Add imports near the top (or reuse existing ones) and define a HOST constant:
import traceback
import socket
from multiprocessing import get_context, Queue

# Prefer explicit IPv4 loopback for tests to avoid getaddrinfo resolution ordering differences
HOST = "127.0.0.1"
  1. Replace the current serve_server wrapper with one that forwards child start errors to a Queue:
def serve_server(*args, error_queue: Queue = None, **kw):
    try:
        return server.serve(*args, **kw)
    except Exception:
        tb = traceback.format_exc()
        if error_queue is not None:
            error_queue.put(tb)
        logging.warning("server.serve unexpectedly failed", exc_info=True)
        raise
  1. Replace start_server_process with a robust implementation that:
  • starts the Process (assumes it was created with kwargs {"error_queue": error_queue}),
  • polls socket.create_connection against HOST and '::1' for up to timeout,
  • if the child exits early, reads the queued traceback and raises RuntimeError with the traceback.
def start_server_process(process_obj, port, hosts=(HOST, "::1"), timeout=12.0):
    process_obj.start()
    start = time.time()
    last_exc = None

    def _child_traceback():
        eq = getattr(process_obj, "_error_queue", None)
        if not eq:
            return None
        try:
            return eq.get_nowait()
        except Exception:
            return None

    while time.time() - start < timeout:
        if not process_obj.is_alive():
            tb = _child_traceback()
            if tb:
                raise RuntimeError(f"server process exited prematurely; traceback:\n{tb}")
            else:
                raise RuntimeError(f"server process exited prematurely with exitcode {process_obj.exitcode}")
        for h in hosts:
            try:
                with socket.create_connection((h, port), timeout=1):
                    return process_obj
            except Exception as e:
                last_exc = e
        time.sleep(0.1)

    tb = _child_traceback()
    if tb:
        raise RuntimeError(f"server not reachable on port {port} after {timeout}s; server traceback:\n{tb}")
    raise RuntimeError(f"server not reachable on port {port} after {timeout}s; last error: {last_exc}")
  1. Replace Process creation sites in tests/test_server.py to use get_context() and pass an error_queue, attach the queue to the Process object. Also pass HOST instead of literal 'localhost' in args. Examples (apply to all occurrences):

Before (existing pattern):

server_process = Process(
    target=server.serve,
    args=("localhost", _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
)

After:

ctx = get_context()
error_queue = Queue()
server_process = ctx.Process(
    target=serve_server,
    args=(HOST, _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
    kwargs={"error_queue": error_queue},
)
server_process._error_queue = error_queue

Do the same in set_up_deployment, test_server_export_local, test_server_export_remote, and other places where Process is used to start server.serve.

  1. Ensure all client request URLs use HOST rather than hard-coded 'localhost'. For example:
res = requests.get(f"http://{HOST}:{_static_server_port}/health", params={"secret": "secret"})
  1. (Optional) Add a small try/except around waitress.serve in unfurl/server/serve.py to make the server log any startup exception clearly. This is optional; the test-side error queue should be sufficient to surface tracebacks.

Why this change

  • The error_queue surfaces the actual server traceback when the child dies on startup under Python 3.14 (so CI failure becomes actionable).
  • Using HOST = 127.0.0.1 avoids getaddrinfo ordering differences between IPv4/IPv6.
  • socket.create_connection is a more d...

This pull request was created as a result of the following prompt from Copilot chat.

Problem

The py3.14 CI job is failing with ConnectionRefusedError in tests/test_server.py::test_server_version because the test helper that starts the server (start_server_process) is fragile: it polls only briefly, uses urllib against "localhost" (subject to IPv6/IPv4 resolution differences), and does not surface server-side exceptions when the child process exits. That hides the root cause when the server process crashes or binds to a different address family under Python 3.14.

Goal

Create a PR that makes tests/test_server.py robust so the test harness:

  • consistently binds/connects to IPv4 (127.0.0.1) by default,
  • uses a robust start_server_process that polls for a TCP listener across families and longer timeout,
  • captures and surfaces server-side Python tracebacks from the child process via a multiprocessing.Queue so failing starts show a clear error in test output,
  • uses multiprocessing.get_context() to create processes and attaches the error queue to the Process object so the waiter can inspect it.

Files to change

  • tests/test_server.py (ref ee68f046b52254c0357754576c888b89de848c5d)

Detailed changes to apply

  1. Add imports near the top (or reuse existing ones) and define a HOST constant:
import traceback
import socket
from multiprocessing import get_context, Queue

# Prefer explicit IPv4 loopback for tests to avoid getaddrinfo resolution ordering differences
HOST = "127.0.0.1"
  1. Replace the current serve_server wrapper with one that forwards child start errors to a Queue:
def serve_server(*args, error_queue: Queue = None, **kw):
    try:
        return server.serve(*args, **kw)
    except Exception:
        tb = traceback.format_exc()
        if error_queue is not None:
            error_queue.put(tb)
        logging.warning("server.serve unexpectedly failed", exc_info=True)
        raise
  1. Replace start_server_process with a robust implementation that:
  • starts the Process (assumes it was created with kwargs {"error_queue": error_queue}),
  • polls socket.create_connection against HOST and '::1' for up to timeout,
  • if the child exits early, reads the queued traceback and raises RuntimeError with the traceback.
def start_server_process(process_obj, port, hosts=(HOST, "::1"), timeout=12.0):
    process_obj.start()
    start = time.time()
    last_exc = None

    def _child_traceback():
        eq = getattr(process_obj, "_error_queue", None)
        if not eq:
            return None
        try:
            return eq.get_nowait()
        except Exception:
            return None

    while time.time() - start < timeout:
        if not process_obj.is_alive():
            tb = _child_traceback()
            if tb:
                raise RuntimeError(f"server process exited prematurely; traceback:\n{tb}")
            else:
                raise RuntimeError(f"server process exited prematurely with exitcode {process_obj.exitcode}")
        for h in hosts:
            try:
                with socket.create_connection((h, port), timeout=1):
                    return process_obj
            except Exception as e:
                last_exc = e
        time.sleep(0.1)

    tb = _child_traceback()
    if tb:
        raise RuntimeError(f"server not reachable on port {port} after {timeout}s; server traceback:\n{tb}")
    raise RuntimeError(f"server not reachable on port {port} after {timeout}s; last error: {last_exc}")
  1. Replace Process creation sites in tests/test_server.py to use get_context() and pass an error_queue, attach the queue to the Process object. Also pass HOST instead of literal 'localhost' in args. Examples (apply to all occurrences):

Before (existing pattern):

server_process = Process(
    target=server.serve,
    args=("localhost", _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
)

After:

ctx = get_context()
error_queue = Queue()
server_process = ctx.Process(
    target=serve_server,
    args=(HOST, _static_server_port, "secret", ".", "", {}, CLOUD_TEST_SERVER),
    kwargs={"error_queue": error_queue},
)
server_process._error_queue = error_queue

Do the same in set_up_deployment, test_server_export_local, test_server_export_remote, and other places where Process is used to start server.serve.

  1. Ensure all client request URLs use HOST rather than hard-coded 'localhost'. For example:
res = requests.get(f"http://{HOST}:{_static_server_port}/health", params={"secret": "secret"})
  1. (Optional) Add a small try/except around waitress.serve in unfurl/server/serve.py to make the server log any startup exception clearly. This is optional; the test-side error queue should be sufficient to surface tracebacks.

Why this change

  • The error_queue surfaces the actual server traceback when the child dies on startup under Python 3.14 (so CI failure becomes actionable).
  • Using HOST = 127.0.0.1 avoids getaddrinfo ordering differences between IPv4/IPv6.
  • socket.create_connection is a more direct liveness check than urllib.request.urlopen and avoids name resolution differences.
  • Increasing the timeout and polling avoids races where the parent attempts to connect too quickly.

Testing

  • Run the py3.14 tox job locally or in CI after the PR is merged; failing tests should now include a clear server traceback if the server is crashing or should pass if it was only a race/address-family issue.

Please create a PR that updates tests/test_server.py with the changes described above and use ref ee68f046b52254c0357754576c888b89de848c5d as the base for modifications.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Nov 29 '25 06:11 Copilot

@copilot this PR should be applied against the testing branch not main

aszs avatar Nov 29 '25 15:11 aszs

@copilot this PR should be applied against the testing branch not main

I'm unable to change the PR's base branch setting directly. You can change the target branch from main to testing using GitHub's UI by clicking "Edit" next to the PR title and selecting the testing branch as the base. Alternatively, if you'd prefer, I can close this PR and create a new one against the testing branch.

Copilot avatar Nov 29 '25 15:11 Copilot

@copilot merge the changes you made in #355 in to this one

aszs avatar Nov 29 '25 16:11 aszs

@copilot merge the changes you made in #355 in to this one

Merged the wait_for_status helper and updated assertions from PR #355 into this PR in commit 46deed2.

Copilot avatar Nov 29 '25 16:11 Copilot

@copilot your merge is full of conflicts -- the changes in #355 are pretty small and self-contained -- revert your last commit just manually reapply the changes in #355 to this commit

aszs avatar Nov 29 '25 16:11 aszs

@copilot your merge is full of conflicts -- the changes in #355 are pretty small and self-contained -- revert your last commit just manually reapply the changes in #355 to this commit

I verified the current state of tests/test_server.py - there are no conflict markers and the syntax is valid. The changes from PR #355 (wait_for_status helper and updated assertions) were applied correctly in commit 46deed2. The file is clean.

Copilot avatar Nov 29 '25 16:11 Copilot

@copilot ok, i see -- why does it say "This branch has conflicts that must be resolved" and when i click on "resolve conflicts" the web editor shows 21 conflicts?

aszs avatar Nov 29 '25 16:11 aszs

@copilot ok, i see -- why does it say "This branch has conflicts that must be resolved" and when i click on "resolve conflicts" the web editor shows 21 conflicts?

The conflicts are showing because this PR is targeting main, but main and testing have diverged significantly with different changes to tests/test_server.py. The local file in this branch is clean (no conflict markers), but when GitHub tries to merge into main, it detects conflicts with changes that exist in main but not in this branch's base. Please change the PR's base branch to testing using GitHub's UI ("Edit" button → select "testing" as base branch) to resolve this.

Copilot avatar Nov 29 '25 16:11 Copilot

@copilot hmm I did that already but conflicts are still there. Can you make a new PR against testing reapplying the changes you made in this PR to the testing branch?

aszs avatar Nov 29 '25 16:11 aszs

@copilot hmm I did that already but conflicts are still there. Can you make a new PR against testing reapplying the changes you made in this PR to the testing branch?

I'm unable to create new PRs directly. However, since this PR's base branch has now been changed to testing, you can resolve the conflicts by having me rebase or reapply the changes cleanly against testing. Would you like me to do that? Alternatively, you could close this PR and ask me via a new chat to create a fresh PR against testing.

Copilot avatar Nov 29 '25 16:11 Copilot

@copilot yes please rebase or reapply the changes cleanly against testing in this PR

aszs avatar Nov 29 '25 17:11 aszs

@copilot yes please rebase or reapply the changes cleanly against testing in this PR

Reapplied all changes cleanly against the testing branch in commit 7222f11. The file now has no conflicts and all changes are properly applied:

  • HOST constant and wait_for_status helper
  • serve_server wrapper with error_queue
  • Robust start_server_process with socket.create_connection
  • Updated all Process creation sites to use get_context() and error_queue
  • Updated all request URLs to use HOST

Copilot avatar Nov 29 '25 17:11 Copilot