Cannot read logs from consumer cluster
Is there an existing issue for this?
- [x] I have searched the existing issues
Version
v1.0.1
What happened?
I peered two clusters:
Consumer Cluster: k3s cluster boostrapped on hetzner via https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner Provider Cluster: private k3s cluster running on tailscale network, no public IPs
All nodes (from both consumer and provider cluster) are also member of the same tailscale network.
Clusters where peered with:
liqoctl peer \
--remote-kubeconfig "$KUBECONFIG_CONSUMER" \
--gw-server-service-location Consumer \
--gw-server-service-type NodePort \
--gw-server-service-port 51840 \
--gw-server-service-nodeport 32050 \
--gw-client-address 100.118.118.27 \ # tailscale IP of control plane node in consumer cluster
--gw-client-port 32050
I can schedule workloads successfully in the consumer cluster and they'll run on the provider clusters:
kubectl --context consumer get pods
NAME READY STATUS RESTARTS AGE
nvidia-smi-test-arm64 0/1 Completed 0 9m41s
kubectl --context provider -n default-sparkling-dust get pods
NAME READY STATUS RESTARTS AGE
nvidia-smi-test-arm64 0/1 Completed 0 9m57s
I can also retrieve the logs from the provider cluster:
kubectl --context provider -n default-sparkling-dust logs -f nvidia-smi-test-arm64
Mon Aug 11 10:51:58 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0 Driver Version: 540.4.0 CUDA Version: 12.6 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Orin (nvgpu) N/A | N/A N/A | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | N/A N/A |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
But when trying to retrieve the logs via the consumer cluster:
kubectl --context consumer logs -f nvidia-smi-test-arm64
Error from server: Get "https://10.42.1.14:10250/containerLogs/default/nvidia-smi-test-arm64/nvidia-smi-container?follow=true": proxy error from 127.0.0.1:6443 while dialing 10.42.1.14:10250, code 502: 502 Bad Gateway
I searched on slack and there are at least two other persons that have the same issue.
Any help / guidance would be very much appreciated!
How can we reproduce the issue?
not sure.
Provider or distribution
k3s
CNI version
flannel
Kernel Version
No response
Kubernetes Version
1.31.11+k3s1
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Hi @maaft, this is a known issue affecting both K3s and RKE2, which share a similar architecture that differs from standard Kubernetes. As I understand it, in the K3s architecture each agent establishes a WebSocket connection to the server via a component called remote dialer. This WebSocket connection is used both to register the agent with the server and to tunnel requests to each agent.
When a kubectl logs or kubectl exec is issued, the request needs to be forwarded to the agent of the node where the pod is running. However, since the Liqo Virtual Kubelet creates no Websocket connection, the request cannot be forwarded, as the tunnel toward that node is missing.
As a result, the following error appears in the server logs:
time="2025-08-11T13:40:50Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:57318: failed to find Session for client test02"
E0811 13:40:50.374249 88 status.go:71] apiserver received an error that is not an metav1.Status: &url.Error{Op:"Get", URL:"https://10.42.0.26:10250/containerLogs/demo/server/server", Err:(*errors.errorString)(0xc018cedcb0)}: Get "https://10.42.0.26:10250/containerLogs/demo/server/server": proxy error from 127.0.0.1:6443 while dialing 10.42.0.26:10250, code 502: 502 Bad Gateway
To fix this issue, I believe some changes on the Liqo VirtualKubelet are required to explicitly support this mechanism, which I think is not a trivial change.
Hi @claudiolor, thanks for replying.
To my knowledge this could also be the case with other k8s implementations starting with 1.31: https://kubernetes.io/blog/2024/08/20/websockets-transition/
This sounds like liqo should definitely come up with a general solution to this anyway, right?
Hi @maaft, the change you linked is about the migration from SPDY to WebSockets for the data streaming between the API server and the kubectl client.
The problem here is that in K3s/RKE2, the API server seems to rely to this remote dialer WebSocket tunnel to forward requests to the right node, this is a custom mechanism used by those distros. So, this cannot be addressed with a general solution but I think it instead requires a specific implementation for K3s/RKE2 to support this mechanism in Liqo. Here some work is required to more deeply understand how this mechanism works and how to integrate this flow with the Virtual Kubelet and Liqo.