prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Windows Job container on kubernetes hit asyncio.exceptions.CancelledError as soon as container starts

Open KnightArthurRen opened this issue 1 month ago • 1 comments

Bug summary

Our workflow force us to execute flow code on a Windows architecture, I've managed to construct a windows image roughly in the shapes of

FROM mcr.microsoft.com/windows/server:ltsc2025 AS base

RUN mkdir C:\\prefect-uv
WORKDIR C:\\prefect-uv

# ... install uv etc etc 

# copy uv files
COPY prefect3/pyproject.toml prefect3/uv.lock ./

# spin up uv environment
RUN uv sync --frozen

# copy python source code
COPY src ./src

WORKDIR C:\\prefect-uv\\src

# update path such that prefect command is reachable without need to prefix with
ENTRYPOINT ["powershell.exe", "-Command", "uv", "run"]

As soon as the container start I encounter this exception

C:\prefect-uv\.venv\Lib\site-packages\tzlocal\utils.py:39: UserWarning: Timezone offset does not match system offset: -28800 != 0. Please, check your config files.
  warnings.warn(msg)
C:\prefect-uv\.venv\Lib\site-packages\tzlocal\utils.py:39: UserWarning: Timezone offset does not match system offset: -28800 != 0. Please, check your config files.
  warnings.warn(msg)
C:\prefect-uv\.venv\Lib\site-packages\pydantic\_internal\_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'default' attribute with value 'UTC' was provided to the `Field()` function, which has no effect in the context it was used. 'default' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.
  warnings.warn(
Traceback (most recent call last):
  File "C:\prefect-uv\.venv\Lib\site-packages\websockets\asyncio\client.py", line 541, in __await_impl__
    self.connection = await self.create_connection()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\prefect-uv\.venv\Lib\site-packages\websockets\asyncio\client.py", line 467, in create_connection
    _, connection = await loop.create_connection(factory, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ContainerAdministrator\AppData\Roaming\uv\python\cpython-3.12.0-windows-x86_64-none\Lib\asyncio\base_events.py", line 1057, in create_connection
    infos = await self._ensure_resolved(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ContainerAdministrator\AppData\Roaming\uv\python\cpython-3.12.0-windows-x86_64-none\Lib\asyncio\base_events.py", line 1433, in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\ContainerAdministrator\AppData\Roaming\uv\python\cpython-3.12.0-windows-x86_64-none\Lib\asyncio\base_events.py", line 878, in getaddrinfo
    return await self.run_in_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

and the container crashed. I tried

  • The same docker image, use no k8 but a docker type worker spinned up in a windows machine -> ✅
    • This tells me the docker image is ok
  • The same k8 job template (exclude the windows os node selector), same dockerfile pattern just built for linux -> ✅
    • This tells me the template is ok and prefect variables are populated properly

So I'm a bit stuck as it seems like the issue lies in the combo of windows architecture + prefect generated job template. Have you guys seen this type of issue or tried run prefect workflows on k8 windows architecture before?

Thank you!

Version info

Version:              3.4.20
API version:          0.8.4
Python version:       3.12.0
Git commit:           5d7d5eb6
Built:                Thu, Sep 25, 2025 09:04 PM
OS/Arch:              win32/AMD64
Profile:              local
Server type:          server
Pydantic version:     2.12.4
Server:
  Database:           sqlite
  SQLite version:     3.43.1
Integrations:
  prefect-aws:        0.5.16
  prefect-docker:     0.6.6
  prefect-kubernetes: 0.7.0
  prefect-shell:      0.3.1
  prefect-slack:      0.3.1

Additional context

No response

KnightArthurRen avatar Nov 10 '25 22:11 KnightArthurRen

Thank you for reporting this issue! I've investigated the codebase and have some insights that may help diagnose and resolve this problem.

Analysis

The error you're encountering (asyncio.exceptions.CancelledError during getaddrinfo) suggests that DNS resolution is timing out or being cancelled during the websocket connection attempt. This is happening when Prefect tries to establish a websocket connection for event streaming.

Key Findings

  1. Websocket Usage in K8s Jobs: When flow runs execute in Kubernetes jobs, Prefect initializes an EventsWorker that establishes a websocket connection to stream events back to the Prefect API. This happens early in the job startup process.

  2. URL Construction: I verified that the websocket URL construction uses proper string operations (not os.path.join), so there's no Windows path bug affecting the URLs:

    • http_to_ws(url) converts http://ws:// and https://wss://
    • Paths are appended using string concatenation: + "/events/in"
  3. No Configurable Timeout: Currently, there's no setting to increase the websocket connection timeout, which may be too short for Windows K8s environments with slower DNS resolution.

Most Likely Root Causes

  1. DNS Resolution Timeout: Windows K8s nodes may have slower DNS resolution for in-cluster services, causing the websocket connection to timeout and cancel during getaddrinfo.

  2. Proxy Misconfiguration: If HTTP_PROXY/HTTPS_PROXY are set on Windows nodes without proper NO_PROXY configuration, internal cluster traffic may be incorrectly routed through a proxy that cannot resolve in-cluster service names.

  3. Windows-Specific Asyncio Behavior: Windows may handle async DNS resolution differently under Kubernetes, leading to early cancellations.

Debugging Steps

Please try the following to help diagnose the issue:

1. Enable Debug Logging

Add these environment variables to your K8s job template:

env:
  - name: PREFECT_LOGGING_LEVEL
    value: "DEBUG"
  - name: PREFECT_DEBUG_MODE
    value: "1"

This will show the exact websocket URL being attempted and timing information.

2. Check Network Configuration

Add a debug container or init container to test connectivity:

# Check DNS resolution
Resolve-DnsName <your-api-host>

# Test TCP connectivity
Test-NetConnection <your-api-host> -Port <your-api-port>

# Test REST API access
Invoke-WebRequest http://<your-api-host>:<your-api-port>/api/health

3. Verify Environment Variables

Print these in your container to check for proxy issues:

$env:PREFECT_API_URL
$env:HTTP_PROXY
$env:HTTPS_PROXY
$env:NO_PROXY

4. Try Using FQDN or ClusterIP

If your PREFECT_API_URL uses a short service name, try using the full FQDN:

http://<service>.<namespace>.svc.cluster.local:<port>/api

Or use the ClusterIP directly to bypass DNS entirely (temporarily for testing).

5. Check NO_PROXY Configuration

Ensure NO_PROXY includes all internal cluster domains:

NO_PROXY=127.0.0.1,localhost,.svc,.svc.cluster.local,<your-api-host>

Potential Solutions

Based on your findings from the debugging steps above, here are potential solutions:

  1. If DNS is slow: We may need to add a configurable websocket connection timeout setting to Prefect
  2. If proxy is the issue: Configure NO_PROXY properly or remove proxy settings for in-cluster traffic
  3. If using short names fails: Use full FQDN or ClusterIP in PREFECT_API_URL

Next Steps

Please share the results of the debugging steps above, particularly:

  • The exact websocket URL from debug logs
  • DNS resolution speed from inside the pod
  • Whether REST API calls work (and how quickly)
  • Your proxy environment variables

This information will help us determine whether this requires a code change in Prefect (e.g., configurable timeouts, better error messages) or is a configuration issue specific to your Windows K8s environment.

I'm happy to help further once we have more diagnostic information!