frankenphp icon indicating copy to clipboard operation
frankenphp copied to clipboard

Worker mode does not use updated endpoints in Kubernetes cluster

Open orlu-nd opened this issue 7 months ago • 11 comments

What happened?

Good day,

We’ve been using FrankenPHP (in non-worker mode) for quite some time on our Kubernetes cluster with great success. It’s been running smoothly and reliably.

However, when we enable worker mode, FrankenPHP does not appear to pick up updated Kubernetes endpoints, such as new pod IP from services like PostgreSQL and/or RabbitMQ. We use PostgreSQL and RabbitMQ in a HA setup, with multiple replications. For both, the Kubernetes service endpoint always referes to a single replica. Whenever this endpoint changes within the cluster (e.g. we kill the main PostgreSQL/RabbitMQ replica), service discovery becomes stale or broken, even though everything continues to work as expected in non-worker mode.

We’re unsure if this behavior stems from a bug, a misconfiguration, or a limitation by design, and would appreciate any guidance or help you can offer. Especially since worker mode offers significant performance improvements that we’d like to benefit from.

Reproducible With Docker Images:

dunglas/frankenphp:1.4.2-php8.4-alpine  
dunglas/frankenphp:1.5.0-php8.3-alpine  
dunglas/frankenphp:1.5.0-php8.4-alpine  
dunglas/frankenphp:1.5.0-php8.3-bookworm  
dunglas/frankenphp:1.5.0-php8.4-bookworm

Reproducible with the following Caddy configs:

Worker mode enabled:

{
	{$CADDY_GLOBAL_OPTIONS}
        frankenphp {
             worker {
                  file ./public/index.php
                  num 10
                  env APP_RUNTIME "Runtime\FrankenPhpSymfony\Runtime"
             }
        }
	order php_server before file_server
}

{$CADDY_EXTRA_CONFIG}

{$SERVER_NAME:localhost} {
	root * public/
	encode zstd br gzip
	{$CADDY_SERVER_EXTRA_DIRECTIVES}
	php_server
}

Worker mode disabled (this solves our issue mentioned above):

{
	{$CADDY_GLOBAL_OPTIONS}
	frankenphp
	order php_server before file_server
}

{$CADDY_EXTRA_CONFIG}

{$SERVER_NAME:localhost} {
	root * public/
	encode zstd br gzip
	{$CADDY_SERVER_EXTRA_DIRECTIVES}
	php_server
}

PHP configuration:

expose_php = 0
date.timezone = UTC
apc.enable_cli = 1
session.use_strict_mode = 1
zend.detect_unicode = 0
display_errors = 0

opcache.preload_user = root
opcache.preload = /app/config/preload.php
realpath_cache_size = 4096K
realpath_cache_ttl = 600
opcache.interned_strings_buffer = 16
opcache.max_accelerated_files = 20000
opcache.memory_consumption = 256
opcache.enable_file_override = 1
opcache.validate_timestamps = 0

variables_order = EGPCS

Kubernetes cluster environment:

  • PostgreSQL Operator: https://cloudnative-pg.io/
  • RabbitMQ Operator: https://www.rabbitmq.com/kubernetes/operator/using-operator/

PostgreSQL deployment:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: db-cluster
spec:
  instances: 3
  storage:
    storageClass: longhorn-static
    size: 10Gi
  walStorage:
    storageClass: longhorn-static
    size: 5Gi

RabbitMQ deployment:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
  labels:
    app.kubernetes.io/instance: rabbitmq
spec:
  replicas: 3
  persistence:
    storageClassName: longhorn-static
    storage: 20Gi
  service:
    type: LoadBalancer
  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi

Application Environment Config

We use a ConfigMap to inject application-level environment variables:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-configmap
data:
  APP_ENV: dev
  APP_ENVIRONMENT: development
  LOCK_DSN: semaphore
  CORS_ALLOW_ORIGIN: '^https?://(localhost|127\.0\.0\.1)(:[0-9]+)?$'
  DATABASE_HOST: db-cluster-rw
  DATABASE_RO_HOST: db-cluster-ro
  RABBITMQ_HOST: rabbitmq
  [...]

These values are correctly visible within the pod via printenv, for example:

DATABASE_HOST=db-cluster-rw
DATABASE_RO_HOST=db-cluster-ro
RABBITMQ_HOST=rabbitmq

Network Resolution

DNS resolution seems to work fine even from within the pod:

/app # ping db-cluster-rw
PING db-cluster-rw (10.96.240.237): 56 data bytes

This refers to a standard Kubernetes service endpoint.

Expected Behavior

In worker mode, we expect the application to properly recognize updated service endpoints (e.g., whenever we kill/delete a replica and a service/endpoint IP changes) in line with what works in non-worker mode. Currently, these endpoint changes are not being picked up, and simple result in:

An exception occurred while executing a query: SQLSTATE[HY000]: General error: 7 no connection to the server

The issue remains unresolved until the FrankenPHP pod is killed or restarted.

Request for help

Could you confirm whether this is:

  • A bug in FrankenPHP worker mode
  • A configuration mistake on our side
  • A known limitation (and if so, is there a recommended workaround?)

We’re happy to test any patches, experimental flags, or configuration tweaks that might help isolate or resolve this issue.

Thanks in advance for your support and for the amazing work on FrankenPHP!

orlu-nd avatar May 14 '25 13:05 orlu-nd

Not an expert, but did you try to reproduce the issue with a very small application not based on any framework? Maybe the limitation is more on the framework not reading env variables again as in frankenphp itself. Maybe try something like following as your worker script:

public/index.php
<?php
// public/index.php

// Prevent worker script termination when a client connection is interrupted
ignore_user_abort(true);

$handler = static function () use ($myApp) {
    echo getenv('DATABASE_HOST');
};

$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
    $keepRunning = \frankenphp_handle_request($handler);

    // Call the garbage collector to reduce the chances of it being triggered in the middle of a page generation
    gc_collect_cycles();

    if (!$keepRunning) break;
}

If the above script returns the correct env var its not issue on frankenphp more on the framework you use to build the application.

PS: What should work for you is to use gracefully restart workers as documented in https://frankenphp.dev/docs/worker/#restart-workers-manually via the Caddy admin API:

curl -X POST http://localhost:2019/frankenphp/workers/restart

After the env where updated / changed.

alexander-schranz avatar May 14 '25 16:05 alexander-schranz

Thank you, @alexander-schranz, for your response.

In this case, we are using Symfony 7.2, and even with the most minimal application setup, the issue still persists.

To clarify further: the problematic environment variables are hostnames (e.g. DATABASE_HOST='db-cluster-rw'). In the index.php example you provided, the environment variable itself will never change, and in all cases remain db-cluster-rw, which is expected. This hostname (variable) will never change.

The issue arises when the IP address behind the hostname changes within the Kubernetes cluster, for example, from 10.0.0.1 to 10.0.0.3. In such cases, FrankenPHP does not appear to resolve or reconnect to the updated IP, continuing to use the outdated one.

Interestingly, if we create a demo script using gethostbyname(getenv('DATABASE_HOST')), the hostname resolves correctly to the updated IP. So DNS resolution itself works as expected inside the container.

To be clear: the environment variable value remains unchanged (which is fine), but FrankenPHP does not seem to re-resolve the hostname to its current IP after the endpoint updates, which leads to stale or broken connections.

To help illustrate the issue more clearly, I’ll write a simple reproducible PHP script that demonstrates the behavior and will share it here tomorrow.

Thanks again for the help, I really appreciate it! OT: Also, many thanks for your work on Sulu, we truly love it.

orlu-nd avatar May 14 '25 17:05 orlu-nd

Interestingly, if we create a demo script using gethostbyname(getenv('DATABASE_HOST')), the hostname resolves correctly to the updated IP. So DNS resolution itself works as expected inside the container.

That sounds like its more application or framework issue if gethostbyname(getenv('DATABASE_HOST')) works like expected. Still there could be DNS caches inside curl or php or php extension which maybe is the issue.

Does the workers/restart resolve it?

Try really a none symfony based application to setup like the one above simple script without any third party code. To make sure its not somewhere in framework or third party code.

alexander-schranz avatar May 14 '25 17:05 alexander-schranz

After extensive troubleshooting and analysis, we have concluded that the issue appears to be rooted more deeply in the PHP runtime itself, or specifically in its integration within php-runtime/frankenphp-symfony, rather than being a FrankenPHP problem.

Our conclusion:

When worker mode is enabled in FrankenPHP, the initial database connection is established once by the worker and stored in the global runtime thread and memory. This connection is then shared across request handler threads to improve performance.

While this is effective under normal conditions, it introduces a critical failure case in dynamic containerized environments like Kubernetes: 1. The database connection is a long-lived socket between the FrankenPHP worker and the database pod. 2. If the database pod fails or is rescheduled, the socket is closed on the DB side but remains open and unacknowledged on the PHP (client) side. 3. This leads to a void TCP socket—still open from the runtime’s perspective but disconnected on the DB side. 4. Because this connection resides in the shared global context, all request threads attempt to reuse it and end up waiting indefinitely for a response that will never arrive. 5. Internally, the global thread continues to write to this socket, unaware that it’s become a dead channel. 6. As a result, the global worker thread becomes unresponsive, holding onto requests without returning control to the handler threads. 7. New incoming requests are routed to other (possibly pre-heated) workers, which eventually suffer the same fate, leading to a cascading effect of hanging processes across the worker pool. 8. On the client side, this manifests as timeout errors in the browser, since the HTTP connection breaks due to lack of response

Verification with FRANKENPHP_LOOP_MAX

To verify our conclusion, we tested the behavior using the environment variable:

FRANKENPHP_LOOP_MAX=1

This forces each worker to restart after every single request, thereby fully resetting the PHP runtime, including its global context and database connection. With this setting in place, the issue no longer occurs, which strongly supports the theory that stale or broken socket connections persist across requests when workers are long-lived.

This also confirms that restarting the worker, as suggested earlier by @alexander-schranz, effectively mitigates the problem by clearing out the global state and reinitializing the connection.

Additional verification using a simple index.php

As per your suggestion, @alexander-schranz, we also tested the following standalone PHP script, which explicitly sets up a new PostgreSQL connection for each request and does not rely on the global PHP runtime:

<?php
// public/index.php

ignore_user_abort(true);

$handler = static function () {
    echo "<html>";

    echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
    echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
    echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
    echo "DATABASE_DBNAME=" . getenv('DATABASE_DBNAME') . "<br>";

    echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes service IP (stable)

    try {
        $dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_DBNAME'));

        $pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
            PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
        ]);

        echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";

        $stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
        $result = $stmt->fetch(PDO::FETCH_ASSOC);

        echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)

        $stmt = $pdo->query('SELECT version()');
        $version = $stmt->fetchColumn();
        echo "PostgreSQL version: $version<br>";

    } catch (PDOException $e) {
        echo "Connection failed: " . $e->getMessage() . "<br>";
    }

    echo "</html>";
};

$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
    $keepRunning = \frankenphp_handle_request($handler);
    gc_collect_cycles();
    if (!$keepRunning) break;
}

This script confirms the root issue: when each request creates a fresh connection, everything works correctly, even across DB pod restarts. This reinforces the conclusion that the problem stems from long-lived connections retained in the shared global runtime, which are not revalidated when the backend pod disappears or changes IP.

Request for help

We’d greatly appreciate help to verify whether the above assumptions are accurate and whether this behavior is expected or avoidable in the current architecture. Any confirmation, suggested workarounds, or deeper insight would be very helpful.

Once again, perhaps this is ultimately more of a PHP runtime issue, or related to the php-runtime/frankenphp-symfony integration, but in practice, it manifests as a FrankenPHP issue due to the worker model and global runtime behavior.

orlu-nd avatar May 15 '25 09:05 orlu-nd

This sounds like the runtime and FrankenPHP behave exactly as desired and documented. It's your code that doesn't handle worker mode properly.

henderkes avatar May 17 '25 09:05 henderkes

Seems related to #290.

Each thread should have their own DB connection. Do you have the same issue if you re-use the connection in the simple index.php case (something like this):

<?php
// public/index.php

ignore_user_abort(true);

$pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);

$handler = static function () use ($pdo) {
    echo "<html>";

    echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
    echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
    echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
    echo "DATABASE_DBNAME=" . getenv('DATABASE_DBNAME') . "<br>";

    echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes service IP (stable)

    try {
        $dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_DBNAME'));

        echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";

        $stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
        $result = $stmt->fetch(PDO::FETCH_ASSOC);

        echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)

        $stmt = $pdo->query('SELECT version()');
        $version = $stmt->fetchColumn();
        echo "PostgreSQL version: $version<br>";

    } catch (PDOException $e) {
        echo "Connection failed: " . $e->getMessage() . "<br>";
    }

    echo "</html>";
};

$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
    $keepRunning = \frankenphp_handle_request($handler);
    gc_collect_cycles();
    if (!$keepRunning) break;
}

AlliBalliBaba avatar May 19 '25 18:05 AlliBalliBaba

Heh; this sounds like you (re)discovered how network sockets work. There is not any communication when a socket is closed; it is up to the application/server to gracefully close connections.

That being said, it sounds like an issue with your database configuration in Kubernetes is not ideal. Generally speaking, databases shouldn’t change their IP addresses during operation, and when you kill/reschedule a database pod, the ip address is changed during operation. Therefore, the database can’t gracefully handle this — your packets leave your application and go into a black hole. There is nothing to respond to the application that the socket has been reset once the new pod comes back up.

Ideally, when using a stateful service, like a database, you should use proper IPAM to ensure that your database acts like it is designed to (such as having a stable IP address).

withinboredom avatar May 22 '25 20:05 withinboredom

That being said, I was assuming you were using a headless service to get the IP address of the pod that gets rescheduled. If that is the case, another solution could be to use the ClusterIP so that the IP address remains stable. This has a small overhead, but will handle the connection reset when the pod shuts down. If you are using something like Cilium, you can reroute the traffic to the nearest pod and get a direct connection anyway.

withinboredom avatar May 22 '25 21:05 withinboredom

@AlliBalliBaba Thanks for your help! You’re right! Using your example results in the same error on our end: a stale connection that isn’t restored until FrankenPHP is restarted.

Script used for testing:

<?php
// public/index.php

ignore_user_abort(true);

$dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_NAME'));

$pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);

$handler = static function () use ($pdo) {
    echo "<html>";

    echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
    echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
    echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
    echo "DATABASE_NAME=" . getenv('DATABASE_NAME') . "<br>";

    echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes Service / ClusterIP (stable)

    try {
        echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";

        $stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
        $result = $stmt->fetch(PDO::FETCH_ASSOC);

        echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)

        $stmt = $pdo->query('SELECT version()');
        $version = $stmt->fetchColumn();
        echo "PostgreSQL version: $version<br>";

    } catch (PDOException $e) {
        echo "Connection failed: " . $e->getMessage() . "<br>";
    }

    echo "</html>";
};

$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
    $keepRunning = \frankenphp_handle_request($handler);
    gc_collect_cycles();
    if (!$keepRunning) break;
}

orlu-nd avatar May 23 '25 06:05 orlu-nd

@withinboredom Thanks for your help! We’re indeed connecting to a Kubernetes Service with a ClusterIP. In our case, we always connect to db-cluster-rw, which consistently resolves to 10.96.240.237. The ClusterIP remains unchanged even when the database pods fail and are rescheduled (which is good).

The Kubernetes Service yaml (directly fetched from the cluster):

apiVersion: v1
kind: Service
metadata:
  annotations:
    cnpg.io/operatorVersion: 1.25.0
  creationTimestamp: "2025-04-09T09:37:00Z"
  labels:
    cnpg.io/cluster:db-cluster
  name:db-cluster-rw
  namespace: claim360
  ownerReferences:
  - apiVersion: postgresql.cnpg.io/v1
    controller: true
    kind: Cluster
    name:db-cluster
    uid: a1445632-0b9c-4666-8243-2911880697a8
  resourceVersion: "28524881"
  uid: 9d3fa245-c228-4d9f-aedb-48b97017dd5e
spec:
  clusterIP: 10.96.240.237
  clusterIPs:
  - 10.96.240.237
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: postgres
    port: 5432
    protocol: TCP
    targetPort: 5432
  selector:
    cnpg.io/cluster:db-cluster
    cnpg.io/instanceRole: primary
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

orlu-nd avatar May 23 '25 06:05 orlu-nd

Hmmm. What are you using for CNI? As soon as you send a packet to that IP and nothing is listening, your CNI should be resetting the connection so your application knows the connection is closed. BUT, this also depends on your kernel settings.

Digging up some notes from running postgres way back in the day and double-checking with chatgpt to make sure they're still relevant in 2025:

  • net.ipv4.tcp_keepalive_time=15 set first tcp keepalive to 15 seconds instead of 3 hours.
  • net.ipv4.tcp_keepalive_intvl=3 set interval between keepalives to 3 seconds instead of nearly 3 minutes.
  • net.ipv4.tcp_keepalive_probes=3 mark the socket as dead after 3 failues.
  • net.ipv4.tcp_retries2=5 give up on sending packets after 5 tries (~5 minutes)

Set these on the client machines. Note that this will affect your entire cluster probably and increase general inter-node traffic, but only when idle; but now when a socket dies, you won't have to wait 3+ hours for the OS to kill an idle socket, or 15 minutes on an active one.

withinboredom avatar May 23 '25 07:05 withinboredom