Worker mode does not use updated endpoints in Kubernetes cluster
What happened?
Good day,
We’ve been using FrankenPHP (in non-worker mode) for quite some time on our Kubernetes cluster with great success. It’s been running smoothly and reliably.
However, when we enable worker mode, FrankenPHP does not appear to pick up updated Kubernetes endpoints, such as new pod IP from services like PostgreSQL and/or RabbitMQ. We use PostgreSQL and RabbitMQ in a HA setup, with multiple replications. For both, the Kubernetes service endpoint always referes to a single replica. Whenever this endpoint changes within the cluster (e.g. we kill the main PostgreSQL/RabbitMQ replica), service discovery becomes stale or broken, even though everything continues to work as expected in non-worker mode.
We’re unsure if this behavior stems from a bug, a misconfiguration, or a limitation by design, and would appreciate any guidance or help you can offer. Especially since worker mode offers significant performance improvements that we’d like to benefit from.
Reproducible With Docker Images:
dunglas/frankenphp:1.4.2-php8.4-alpine
dunglas/frankenphp:1.5.0-php8.3-alpine
dunglas/frankenphp:1.5.0-php8.4-alpine
dunglas/frankenphp:1.5.0-php8.3-bookworm
dunglas/frankenphp:1.5.0-php8.4-bookworm
Reproducible with the following Caddy configs:
Worker mode enabled:
{
{$CADDY_GLOBAL_OPTIONS}
frankenphp {
worker {
file ./public/index.php
num 10
env APP_RUNTIME "Runtime\FrankenPhpSymfony\Runtime"
}
}
order php_server before file_server
}
{$CADDY_EXTRA_CONFIG}
{$SERVER_NAME:localhost} {
root * public/
encode zstd br gzip
{$CADDY_SERVER_EXTRA_DIRECTIVES}
php_server
}
Worker mode disabled (this solves our issue mentioned above):
{
{$CADDY_GLOBAL_OPTIONS}
frankenphp
order php_server before file_server
}
{$CADDY_EXTRA_CONFIG}
{$SERVER_NAME:localhost} {
root * public/
encode zstd br gzip
{$CADDY_SERVER_EXTRA_DIRECTIVES}
php_server
}
PHP configuration:
expose_php = 0
date.timezone = UTC
apc.enable_cli = 1
session.use_strict_mode = 1
zend.detect_unicode = 0
display_errors = 0
opcache.preload_user = root
opcache.preload = /app/config/preload.php
realpath_cache_size = 4096K
realpath_cache_ttl = 600
opcache.interned_strings_buffer = 16
opcache.max_accelerated_files = 20000
opcache.memory_consumption = 256
opcache.enable_file_override = 1
opcache.validate_timestamps = 0
variables_order = EGPCS
Kubernetes cluster environment:
- PostgreSQL Operator: https://cloudnative-pg.io/
- RabbitMQ Operator: https://www.rabbitmq.com/kubernetes/operator/using-operator/
PostgreSQL deployment:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: db-cluster
spec:
instances: 3
storage:
storageClass: longhorn-static
size: 10Gi
walStorage:
storageClass: longhorn-static
size: 5Gi
RabbitMQ deployment:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: rabbitmq
labels:
app.kubernetes.io/instance: rabbitmq
spec:
replicas: 3
persistence:
storageClassName: longhorn-static
storage: 20Gi
service:
type: LoadBalancer
resources:
requests:
memory: 2Gi
limits:
memory: 2Gi
Application Environment Config
We use a ConfigMap to inject application-level environment variables:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-configmap
data:
APP_ENV: dev
APP_ENVIRONMENT: development
LOCK_DSN: semaphore
CORS_ALLOW_ORIGIN: '^https?://(localhost|127\.0\.0\.1)(:[0-9]+)?$'
DATABASE_HOST: db-cluster-rw
DATABASE_RO_HOST: db-cluster-ro
RABBITMQ_HOST: rabbitmq
[...]
These values are correctly visible within the pod via printenv, for example:
DATABASE_HOST=db-cluster-rw
DATABASE_RO_HOST=db-cluster-ro
RABBITMQ_HOST=rabbitmq
Network Resolution
DNS resolution seems to work fine even from within the pod:
/app # ping db-cluster-rw
PING db-cluster-rw (10.96.240.237): 56 data bytes
This refers to a standard Kubernetes service endpoint.
Expected Behavior
In worker mode, we expect the application to properly recognize updated service endpoints (e.g., whenever we kill/delete a replica and a service/endpoint IP changes) in line with what works in non-worker mode. Currently, these endpoint changes are not being picked up, and simple result in:
An exception occurred while executing a query: SQLSTATE[HY000]: General error: 7 no connection to the server
The issue remains unresolved until the FrankenPHP pod is killed or restarted.
Request for help
Could you confirm whether this is:
- A bug in FrankenPHP worker mode
- A configuration mistake on our side
- A known limitation (and if so, is there a recommended workaround?)
We’re happy to test any patches, experimental flags, or configuration tweaks that might help isolate or resolve this issue.
Thanks in advance for your support and for the amazing work on FrankenPHP!
Not an expert, but did you try to reproduce the issue with a very small application not based on any framework? Maybe the limitation is more on the framework not reading env variables again as in frankenphp itself. Maybe try something like following as your worker script:
public/index.php
<?php
// public/index.php
// Prevent worker script termination when a client connection is interrupted
ignore_user_abort(true);
$handler = static function () use ($myApp) {
echo getenv('DATABASE_HOST');
};
$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
$keepRunning = \frankenphp_handle_request($handler);
// Call the garbage collector to reduce the chances of it being triggered in the middle of a page generation
gc_collect_cycles();
if (!$keepRunning) break;
}
If the above script returns the correct env var its not issue on frankenphp more on the framework you use to build the application.
PS: What should work for you is to use gracefully restart workers as documented in https://frankenphp.dev/docs/worker/#restart-workers-manually via the Caddy admin API:
curl -X POST http://localhost:2019/frankenphp/workers/restart
After the env where updated / changed.
Thank you, @alexander-schranz, for your response.
In this case, we are using Symfony 7.2, and even with the most minimal application setup, the issue still persists.
To clarify further: the problematic environment variables are hostnames (e.g. DATABASE_HOST='db-cluster-rw'). In the index.php example you provided, the environment variable itself will never change, and in all cases remain db-cluster-rw, which is expected. This hostname (variable) will never change.
The issue arises when the IP address behind the hostname changes within the Kubernetes cluster, for example, from 10.0.0.1 to 10.0.0.3. In such cases, FrankenPHP does not appear to resolve or reconnect to the updated IP, continuing to use the outdated one.
Interestingly, if we create a demo script using gethostbyname(getenv('DATABASE_HOST')), the hostname resolves correctly to the updated IP. So DNS resolution itself works as expected inside the container.
To be clear: the environment variable value remains unchanged (which is fine), but FrankenPHP does not seem to re-resolve the hostname to its current IP after the endpoint updates, which leads to stale or broken connections.
To help illustrate the issue more clearly, I’ll write a simple reproducible PHP script that demonstrates the behavior and will share it here tomorrow.
Thanks again for the help, I really appreciate it! OT: Also, many thanks for your work on Sulu, we truly love it.
Interestingly, if we create a demo script using gethostbyname(getenv('DATABASE_HOST')), the hostname resolves correctly to the updated IP. So DNS resolution itself works as expected inside the container.
That sounds like its more application or framework issue if gethostbyname(getenv('DATABASE_HOST')) works like expected. Still there could be DNS caches inside curl or php or php extension which maybe is the issue.
Does the workers/restart resolve it?
Try really a none symfony based application to setup like the one above simple script without any third party code. To make sure its not somewhere in framework or third party code.
After extensive troubleshooting and analysis, we have concluded that the issue appears to be rooted more deeply in the PHP runtime itself, or specifically in its integration within php-runtime/frankenphp-symfony, rather than being a FrankenPHP problem.
Our conclusion:
When worker mode is enabled in FrankenPHP, the initial database connection is established once by the worker and stored in the global runtime thread and memory. This connection is then shared across request handler threads to improve performance.
While this is effective under normal conditions, it introduces a critical failure case in dynamic containerized environments like Kubernetes: 1. The database connection is a long-lived socket between the FrankenPHP worker and the database pod. 2. If the database pod fails or is rescheduled, the socket is closed on the DB side but remains open and unacknowledged on the PHP (client) side. 3. This leads to a void TCP socket—still open from the runtime’s perspective but disconnected on the DB side. 4. Because this connection resides in the shared global context, all request threads attempt to reuse it and end up waiting indefinitely for a response that will never arrive. 5. Internally, the global thread continues to write to this socket, unaware that it’s become a dead channel. 6. As a result, the global worker thread becomes unresponsive, holding onto requests without returning control to the handler threads. 7. New incoming requests are routed to other (possibly pre-heated) workers, which eventually suffer the same fate, leading to a cascading effect of hanging processes across the worker pool. 8. On the client side, this manifests as timeout errors in the browser, since the HTTP connection breaks due to lack of response
Verification with FRANKENPHP_LOOP_MAX
To verify our conclusion, we tested the behavior using the environment variable:
FRANKENPHP_LOOP_MAX=1
This forces each worker to restart after every single request, thereby fully resetting the PHP runtime, including its global context and database connection. With this setting in place, the issue no longer occurs, which strongly supports the theory that stale or broken socket connections persist across requests when workers are long-lived.
This also confirms that restarting the worker, as suggested earlier by @alexander-schranz, effectively mitigates the problem by clearing out the global state and reinitializing the connection.
Additional verification using a simple index.php
As per your suggestion, @alexander-schranz, we also tested the following standalone PHP script, which explicitly sets up a new PostgreSQL connection for each request and does not rely on the global PHP runtime:
<?php
// public/index.php
ignore_user_abort(true);
$handler = static function () {
echo "<html>";
echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
echo "DATABASE_DBNAME=" . getenv('DATABASE_DBNAME') . "<br>";
echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes service IP (stable)
try {
$dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_DBNAME'));
$pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);
echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";
$stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
$result = $stmt->fetch(PDO::FETCH_ASSOC);
echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)
$stmt = $pdo->query('SELECT version()');
$version = $stmt->fetchColumn();
echo "PostgreSQL version: $version<br>";
} catch (PDOException $e) {
echo "Connection failed: " . $e->getMessage() . "<br>";
}
echo "</html>";
};
$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
$keepRunning = \frankenphp_handle_request($handler);
gc_collect_cycles();
if (!$keepRunning) break;
}
This script confirms the root issue: when each request creates a fresh connection, everything works correctly, even across DB pod restarts. This reinforces the conclusion that the problem stems from long-lived connections retained in the shared global runtime, which are not revalidated when the backend pod disappears or changes IP.
Request for help
We’d greatly appreciate help to verify whether the above assumptions are accurate and whether this behavior is expected or avoidable in the current architecture. Any confirmation, suggested workarounds, or deeper insight would be very helpful.
Once again, perhaps this is ultimately more of a PHP runtime issue, or related to the php-runtime/frankenphp-symfony integration, but in practice, it manifests as a FrankenPHP issue due to the worker model and global runtime behavior.
This sounds like the runtime and FrankenPHP behave exactly as desired and documented. It's your code that doesn't handle worker mode properly.
Seems related to #290.
Each thread should have their own DB connection. Do you have the same issue if you re-use the connection in the simple index.php case (something like this):
<?php
// public/index.php
ignore_user_abort(true);
$pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);
$handler = static function () use ($pdo) {
echo "<html>";
echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
echo "DATABASE_DBNAME=" . getenv('DATABASE_DBNAME') . "<br>";
echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes service IP (stable)
try {
$dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_DBNAME'));
echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";
$stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
$result = $stmt->fetch(PDO::FETCH_ASSOC);
echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)
$stmt = $pdo->query('SELECT version()');
$version = $stmt->fetchColumn();
echo "PostgreSQL version: $version<br>";
} catch (PDOException $e) {
echo "Connection failed: " . $e->getMessage() . "<br>";
}
echo "</html>";
};
$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
$keepRunning = \frankenphp_handle_request($handler);
gc_collect_cycles();
if (!$keepRunning) break;
}
Heh; this sounds like you (re)discovered how network sockets work. There is not any communication when a socket is closed; it is up to the application/server to gracefully close connections.
That being said, it sounds like an issue with your database configuration in Kubernetes is not ideal. Generally speaking, databases shouldn’t change their IP addresses during operation, and when you kill/reschedule a database pod, the ip address is changed during operation. Therefore, the database can’t gracefully handle this — your packets leave your application and go into a black hole. There is nothing to respond to the application that the socket has been reset once the new pod comes back up.
Ideally, when using a stateful service, like a database, you should use proper IPAM to ensure that your database acts like it is designed to (such as having a stable IP address).
That being said, I was assuming you were using a headless service to get the IP address of the pod that gets rescheduled. If that is the case, another solution could be to use the ClusterIP so that the IP address remains stable. This has a small overhead, but will handle the connection reset when the pod shuts down. If you are using something like Cilium, you can reroute the traffic to the nearest pod and get a direct connection anyway.
@AlliBalliBaba Thanks for your help! You’re right! Using your example results in the same error on our end: a stale connection that isn’t restored until FrankenPHP is restarted.
Script used for testing:
<?php
// public/index.php
ignore_user_abort(true);
$dsn = sprintf("pgsql:host=%s;port=%s;dbname=%s", getenv('DATABASE_HOST'), '5432', getenv('DATABASE_NAME'));
$pdo = new PDO($dsn, getenv('DATABASE_USERNAME'), getenv('DATABASE_PASSWORD'), [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
]);
$handler = static function () use ($pdo) {
echo "<html>";
echo "DATABASE_HOST=" . getenv('DATABASE_HOST') . "<br>";
echo "DATABASE_USERNAME=" . getenv('DATABASE_USERNAME') . "<br>";
echo "DATABASE_PASSWORD=" . getenv('DATABASE_PASSWORD') . "<br>";
echo "DATABASE_NAME=" . getenv('DATABASE_NAME') . "<br>";
echo "Hostname: " . gethostbyname(getenv('DATABASE_HOST')) . "<br><br>"; // Kubernetes Service / ClusterIP (stable)
try {
echo "Connected successfully to PostgreSQL server at " . gethostbyname(getenv('DATABASE_HOST')) . "<br>";
$stmt = $pdo->query("SELECT inet_client_addr(), inet_server_addr()");
$result = $stmt->fetch(PDO::FETCH_ASSOC);
echo "Server IP: " . $result['inet_server_addr'] . "<br>"; // Actual DB pod IP (changes)
$stmt = $pdo->query('SELECT version()');
$version = $stmt->fetchColumn();
echo "PostgreSQL version: $version<br>";
} catch (PDOException $e) {
echo "Connection failed: " . $e->getMessage() . "<br>";
}
echo "</html>";
};
$maxRequests = (int)($_SERVER['MAX_REQUESTS'] ?? 0);
for ($nbRequests = 0; !$maxRequests || $nbRequests < $maxRequests; ++$nbRequests) {
$keepRunning = \frankenphp_handle_request($handler);
gc_collect_cycles();
if (!$keepRunning) break;
}
@withinboredom Thanks for your help! We’re indeed connecting to a Kubernetes Service with a ClusterIP. In our case, we always connect to db-cluster-rw, which consistently resolves to 10.96.240.237. The ClusterIP remains unchanged even when the database pods fail and are rescheduled (which is good).
The Kubernetes Service yaml (directly fetched from the cluster):
apiVersion: v1
kind: Service
metadata:
annotations:
cnpg.io/operatorVersion: 1.25.0
creationTimestamp: "2025-04-09T09:37:00Z"
labels:
cnpg.io/cluster:db-cluster
name:db-cluster-rw
namespace: claim360
ownerReferences:
- apiVersion: postgresql.cnpg.io/v1
controller: true
kind: Cluster
name:db-cluster
uid: a1445632-0b9c-4666-8243-2911880697a8
resourceVersion: "28524881"
uid: 9d3fa245-c228-4d9f-aedb-48b97017dd5e
spec:
clusterIP: 10.96.240.237
clusterIPs:
- 10.96.240.237
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: postgres
port: 5432
protocol: TCP
targetPort: 5432
selector:
cnpg.io/cluster:db-cluster
cnpg.io/instanceRole: primary
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
Hmmm. What are you using for CNI? As soon as you send a packet to that IP and nothing is listening, your CNI should be resetting the connection so your application knows the connection is closed. BUT, this also depends on your kernel settings.
Digging up some notes from running postgres way back in the day and double-checking with chatgpt to make sure they're still relevant in 2025:
net.ipv4.tcp_keepalive_time=15set first tcp keepalive to 15 seconds instead of 3 hours.net.ipv4.tcp_keepalive_intvl=3set interval between keepalives to 3 seconds instead of nearly 3 minutes.net.ipv4.tcp_keepalive_probes=3mark the socket as dead after 3 failues.net.ipv4.tcp_retries2=5give up on sending packets after 5 tries (~5 minutes)
Set these on the client machines. Note that this will affect your entire cluster probably and increase general inter-node traffic, but only when idle; but now when a socket dies, you won't have to wait 3+ hours for the OS to kill an idle socket, or 15 minutes on an active one.