liqo Cannot read logs from consumer cluster

Is there an existing issue for this?

[x] I have searched the existing issues

Version

v1.0.1

What happened?

I peered two clusters:

Consumer Cluster: k3s cluster boostrapped on hetzner via https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner Provider Cluster: private k3s cluster running on tailscale network, no public IPs

All nodes (from both consumer and provider cluster) are also member of the same tailscale network.

Clusters where peered with:

liqoctl peer \
    --remote-kubeconfig "$KUBECONFIG_CONSUMER" \
    --gw-server-service-location Consumer \
    --gw-server-service-type NodePort \
    --gw-server-service-port 51840 \
    --gw-server-service-nodeport 32050 \
    --gw-client-address 100.118.118.27 \ # tailscale IP of control plane node in consumer cluster
    --gw-client-port 32050

I can schedule workloads successfully in the consumer cluster and they'll run on the provider clusters:

kubectl --context consumer get pods
NAME                    READY   STATUS      RESTARTS   AGE
nvidia-smi-test-arm64   0/1     Completed   0          9m41s

kubectl --context provider  -n default-sparkling-dust  get pods
NAME                    READY   STATUS      RESTARTS   AGE
nvidia-smi-test-arm64   0/1     Completed   0          9m57s

I can also retrieve the logs from the provider cluster:

kubectl --context provider -n default-sparkling-dust logs -f nvidia-smi-test-arm64 
Mon Aug 11 10:51:58 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.4.0                Driver Version: 540.4.0      CUDA Version: 12.6     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

But when trying to retrieve the logs via the consumer cluster:

 kubectl --context consumer logs -f nvidia-smi-test-arm64 
Error from server: Get "https://10.42.1.14:10250/containerLogs/default/nvidia-smi-test-arm64/nvidia-smi-container?follow=true": proxy error from 127.0.0.1:6443 while dialing 10.42.1.14:10250, code 502: 502 Bad Gateway

I searched on slack and there are at least two other persons that have the same issue.

Any help / guidance would be very much appreciated!

How can we reproduce the issue?

not sure.

Provider or distribution

k3s

CNI version

flannel

Kernel Version

No response

Kubernetes Version

1.31.11+k3s1

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Aug 11 '25 11:08 maaft

Hi @maaft, this is a known issue affecting both K3s and RKE2, which share a similar architecture that differs from standard Kubernetes. As I understand it, in the K3s architecture each agent establishes a WebSocket connection to the server via a component called remote dialer. This WebSocket connection is used both to register the agent with the server and to tunnel requests to each agent.

When a kubectl logs or kubectl exec is issued, the request needs to be forwarded to the agent of the node where the pod is running. However, since the Liqo Virtual Kubelet creates no Websocket connection, the request cannot be forwarded, as the tunnel toward that node is missing. As a result, the following error appears in the server logs:

time="2025-08-11T13:40:50Z" level=error msg="Sending HTTP 502 response to 127.0.0.1:57318: failed to find Session for client test02"
E0811 13:40:50.374249      88 status.go:71] apiserver received an error that is not an metav1.Status: &url.Error{Op:"Get", URL:"https://10.42.0.26:10250/containerLogs/demo/server/server", Err:(*errors.errorString)(0xc018cedcb0)}: Get "https://10.42.0.26:10250/containerLogs/demo/server/server": proxy error from 127.0.0.1:6443 while dialing 10.42.0.26:10250, code 502: 502 Bad Gateway

To fix this issue, I believe some changes on the Liqo VirtualKubelet are required to explicitly support this mechanism, which I think is not a trivial change.

Aug 11 '25 15:08 claudiolor

Hi @claudiolor, thanks for replying.

To my knowledge this could also be the case with other k8s implementations starting with 1.31: https://kubernetes.io/blog/2024/08/20/websockets-transition/

This sounds like liqo should definitely come up with a general solution to this anyway, right?

Aug 13 '25 12:08 maaft

Hi @maaft, the change you linked is about the migration from SPDY to WebSockets for the data streaming between the API server and the kubectl client.

The problem here is that in K3s/RKE2, the API server seems to rely to this remote dialer WebSocket tunnel to forward requests to the right node, this is a custom mechanism used by those distros. So, this cannot be addressed with a general solution but I think it instead requires a specific implementation for K3s/RKE2 to support this mechanism in Liqo. Here some work is required to more deeply understand how this mechanism works and how to integrate this flow with the Virtual Kubelet and Liqo.

Aug 25 '25 13:08 claudiolor