liqo
liqo copied to clipboard
Unable to connect to Kubernetes API server, after namesapce offloading
What happened:
We are unable to connect to the Kubernetes API server of the host cluster from the member cluster using the kubernetesClient java API. getting the read timeout error.
Error :
Kubernetes error during API access check, io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.. Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.. Caused by: java.net.SocketTimeoutException: Read timed out.
To test this we are using the curl command
curl https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/openapi/v2 --header "Authorization: Bearer $(cat /var/run/secrets/[kubernetes.io/serviceaccount/token](http://kubernetes.io/serviceaccount/token))" --cacert /var/run/secrets/[kubernetes.io/serviceaccount/ca.crt](http://kubernetes.io/serviceaccount/ca.crt)
This was working earlier but had stopped with the latest version, had tried on v0.5.4 and had same issue
What you expected to happen:
Able to connect to Kubernetes API server of host cluster from member cluster
How to reproduce it (as minimally and precisely as possible):
- Create 2 AKS cluster (host and member)
- Offload 2 namespaces from host to member cluster. namespace1 with podoffloadingStrategy as Local and namespace2 with podoffloadingStrategy as Remote
- Deploy a pod on host cluster on namespace2
- Pod tries to connect to kubernetes API server of host cluster, which causes the issue.
Anything else we need to know?:
This works without the namespace offloading if pod is deployed on another namespace, other than the offloaded namespace.
Environment:
- Liqo version: v.0.6.0, v0.5.4
- Kubernetes version (use
kubectl version
): v1.23 - Cloud provider or hardware configuration: Azure
- Network plugin and version: Kubenet
- Install tools: liqoctl
- Others:
Hi @saushind-tibco, thanks for reporting this issue.
Just to confirm, does it work when you test it using curl or not? I would like to understand whether it is a more general issue or it is related to the Java client.
If it does not work with curl, could you please show the content of the KUBERNETES_SERVICE_HOST
and KUBERNETES_SERVICE_HOST
environment variables (of the offloaded pod), that of the /etc/hosts
file (of the offloaded pod), and the output of kubectl get natmappings -o yaml
(in the host cluster)?
Hi @giorio94, nope it didn't work with the curl command.
KUBERNETES_SERVICE_HOST= kubernetes.default or 10.244.0.22
KUBERNETES_SERVICE_PORT=443
Output for kubectl get natmappings -o yaml
apiVersion: v1
items:
- apiVersion: net.liqo.io/v1alpha1
kind: NatMapping
metadata:
creationTimestamp: 2022-11-28T11:07:49Z
generateName: natmapping-
generation: 4
labels:
clusterID: xxxxxx-xxxx-xxxxx-xxxxxx-xxxx
net.liqo.io/natmapping: "true"
managedFields:
- apiVersion: net.liqo.io/v1alpha1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:generateName: {}
f:labels:
.: {}
f:clusterID: {}
f:net.liqo.io/natmapping: {}
f:spec:
.: {}
f:clusterID: {}
f:clusterMappings:
.: {}
f:10.0.0.1: {}
f:10.224.0.4: {}
f:169.254.0.1: {}
f:externalCIDR: {}
f:podCIDR: {}
manager: liqonet
operation: Update
time: 2022-11-28T12:11:09Z
name: natmapping-q6n2s
namespace: ""
resourceVersion: "107624"
uid: xxxxxx-xxxx-xxxxx-xxxxxx-xxxx
spec:
clusterID: xxxxxx-xxxx-xxxxx-xxxxxx-xxxx
clusterMappings:
10.0.0.1: 10.245.0.3
10.224.0.4: 10.245.0.2
169.254.0.1: 10.245.0.1
externalCIDR: 10.245.0.0/16
podCIDR: 10.241.0.0/16
kind: List
metadata:
resourceVersion: ""
selfLink: ""
This used to work with liqo v0.3.0 and kubernetes v1.21 but with latest its unable to connect
The feature enabling offloaded pods to interact with the API server of the originating cluster has been introduced in liqo v0.5.0, and was not present in earlier versions (cf. https://github.com/liqotech/liqo/issues/1185 for more info). This feature involves the synchronization of service account tokens (currently supported only up to kubernetes 1.23), and the configuration of host aliases and environment variables to make offloaded pods contact the originating API server, rather that the one of the cluster they are currently running into.
Could you please share also the yaml of the offloaded pod, as seen in the target cluster (not the originating one)? The thing is that the IP address set by liqo and corresponding to the kubernetes.default
host alias should match one present in the natmapping resource (10.254.0.3 in your case, rather than 10.244.0.22).
Are there any challenges in making this work on kubernetes 1.24? If you point to me in write direction to debug, I can help.
apiVersion: v1
kind: Pod
metadata:
name: kafka-test1-jx22j-54dd9bb848-vhptr
namespace: testnamespace
uid: u6657-67bf-45b4-b4f7-956e2d4b0218
resourceVersion: '1182821'
creationTimestamp: '2022-11-28T12:13:52Z'
labels:
app: kafka-test1-jx22j
liqo.io/managed-by: shadowpod
pod-template-hash: 54dd9bb848
virtualkubelet.liqo.io/destination: 6e946a7a-fdd0-4ab3-abfb-155db35e9ae6
virtualkubelet.liqo.io/origin: ad6787-443b-4b11-b56b-6bebf37a4ece
ownerReferences:
- apiVersion: virtualkubelet.liqo.io/v1alpha1
kind: ShadowPod
name: kafka-test1-jx22j-54dd9bb848-vhptr
uid: af9e359d-9f88-440c-a331-0823b45ea1d1
controller: true
blockOwnerDeletion: true
managedFields:
- manager: liqo-controller-manager
operation: Update
apiVersion: v1
time: '2022-11-28T12:13:52Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:app: {}
f:liqo.io/managed-by: {}
f:pod-template-hash: {}
f:virtualkubelet.liqo.io/destination: {}
f:virtualkubelet.liqo.io/origin: {}
f:ownerReferences:
.: {}
k:{"uid":"af9e359d-9f88-440c-a331-0823b99ea1d1"}: {}
f:spec:
f:automountServiceAccountToken: {}
f:containers:
k:{"name":"kafka-test1-jx22j"}:
.: {}
f:env:
.: {}
k:{"name":"ADMIN_PASSWORD"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:secretKeyRef: {}
k:{"name":"APPNAME"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"ELASTICSEARCH_PASSWORD"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:secretKeyRef: {}
k:{"name":"KUBERNETES_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_PORT_443_TCP"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_PORT_443_TCP_ADDR"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_PORT_443_TCP_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_PORT_443_TCP_PROTO"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_SERVICE_HOST"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBERNETES_SERVICE_PORT"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POD_NAME"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef: {}
k:{"name":"POD_NAMESPACE"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef: {}
k:{"name":"test2_DEPLOYDIRECTORIES"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"test2_NODENAME"}:
.: {}
f:name: {}
f:value: {}
f:image: {}
f:imagePullPolicy: {}
f:lifecycle:
.: {}
f:preStop:
.: {}
f:exec:
.: {}
f:command: {}
f:livenessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":8008,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
k:{"containerPort":8080,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
f:resources: {}
f:securityContext:
.: {}
f:readOnlyRootFilesystem: {}
f:runAsUser: {}
f:startupProbe:
.: {}
f:exec:
.: {}
f:command: {}
f:failureThreshold: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:tty: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/home/xxxxx"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/tmp"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/opt/xxxx/xxxxxx/classpath"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/opt/xxxx/xxxxxx/node"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/opt/xxxx/xxxxxx/abcconfig"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:hostAliases:
.: {}
k:{"ip":"10.245.0.3"}:
.: {}
f:hostnames: {}
f:ip: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:terminationGracePeriodSeconds: {}
f:tolerations: {}
f:volumes:
.: {}
k:{"name":"classpath"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:name: {}
k:{"name":"home"}:
.: {}
f:emptyDir: {}
f:name: {}
k:{"name":"kube-api-access-7mnqt"}:
.: {}
f:name: {}
f:projected:
.: {}
f:defaultMode: {}
f:sources: {}
k:{"name":"node"}:
.: {}
f:emptyDir: {}
f:name: {}
k:{"name":"abcconfig"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:name: {}
k:{"name":"tmp"}:
.: {}
f:emptyDir: {}
f:name: {}
- manager: kubelet
operation: Update
apiVersion: v1
time: '2022-11-28T12:13:54Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.244.0.22"}:
.: {}
f:ip: {}
f:startTime: {}
subresource: status
selfLink: /api/v1/namespaces/testnamespace/pods/kafka-test1-jx22j-54dd9bb848-vhptr
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2022-11-28T12:13:52Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-11-28T12:13:52Z'
reason: ContainersNotReady
message: 'containers with unready status: [kafka-test1-jx22j]'
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-11-28T12:13:52Z'
reason: ContainersNotReady
message: 'containers with unready status: [kafka-test1-jx22j]'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2022-11-28T12:13:52Z'
hostIP: 10.224.0.4
podIP: 10.244.0.22
podIPs:
- ip: 10.244.0.22
startTime: '2022-11-28T12:13:52Z'
containerStatuses:
- name: kafka-test1-jx22j
state:
running:
startedAt: '2022-12-01T13:29:12Z'
lastState:
terminated:
exitCode: 137
reason: Error
startedAt: '2022-12-01T13:18:50Z'
finishedAt: '2022-12-01T13:29:11Z'
containerID: >-
containerd://06e0e80ba76329ddc5aa0363f746d9d2a76c4d1922f579ba
ready: false
restartCount: 424
image: hostcluster.azurecr.io/kafka-test1:1.3.0-SNAPSHOT
imageID: >-
hostcluster.azurecr.io/kafka-test1@sha256:0511819bf0b2e03475c57b93b9792a91947f65378755
containerID: >-
containerd://38a8689e586b438787538ca89478582898898754dc1855c851c887
started: false
qosClass: BestEffort
spec:
volumes:
- name: tmp
emptyDir: {}
- name: home
emptyDir: {}
- name: node
emptyDir: {}
- name: abcconfig
configMap:
name: kafka-test1-jx22j-flow
defaultMode: 420
- name: classpath
configMap:
name: kafka-test1-jx22j-classpath
defaultMode: 420
- name: kube-api-access-7mnqt
projected:
sources:
- secret:
name: default-token-btkq6
items:
- key: token
path: token
- configMap:
name: kube-root-ca.crt.ad6e0
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: kafka-test1-jx22j
image: hostcluster.azurecr.io/kafka-test1:1.3
ports:
- name: admin-ui
containerPort: 8008
protocol: TCP
- name: ws
containerPort: 8080
protocol: TCP
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: APPNAME
value: kafka-test1-jx22j
- name: test2_NODENAME
value: $(POD_NAME).$(APPNAME)
- name: test2_DEPLOYDIRECTORIES
value: /var/opt/xxxx/xxxxxx/classpath
- name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: elasticsearch-es-elastic
key: elastic
- name: ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: scoring-admin
key: admin
optional: true
- name: KUBERNETES_SERVICE_HOST
value: kubernetes.default
- name: KUBERNETES_SERVICE_PORT
value: '443'
- name: KUBERNETES_PORT
value: tcp://kubernetes.default:443
- name: KUBERNETES_PORT_443_TCP
value: tcp://kubernetes.default:443
- name: KUBERNETES_PORT_443_TCP_PROTO
value: tcp
- name: KUBERNETES_PORT_443_TCP_ADDR
value: kubernetes.default
- name: KUBERNETES_PORT_443_TCP_PORT
value: '443'
resources: {}
volumeMounts:
- name: tmp
mountPath: /tmp
- name: home
mountPath: /home/xxxxx
- name: node
mountPath: /var/opt/xxxx/xxxxxx/node
- name: abcconfig
mountPath: /var/opt/xxxx/xxxxxx/abcconfig
- name: classpath
mountPath: /var/opt/xxxx/xxxxxx/classpath
- name: kube-api-access-7mnqt
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
livenessProbe:
httpGet:
path: /healthcheck/v1/status
port: admin-ui
scheme: HTTP
initialDelaySeconds: 240
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
startupProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 20
timeoutSeconds: 1
periodSeconds: 2
successThreshold: 1
failureThreshold: 300
lifecycle:
preStop:
exec:
command:
- /bin/sh
- '-c'
- /opt/xxxx/xxxxxx/stop-node
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
securityContext:
runAsUser: 1000
readOnlyRootFilesystem: true
tty: true
restartPolicy: Always
terminationGracePeriodSeconds: 1
dnsPolicy: ClusterFirst
serviceAccountName: default
serviceAccount: default
automountServiceAccountToken: false
nodeName: aks-nodepool1-14784209-vmss000000
securityContext: {}
schedulerName: default-scheduler
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
hostAliases:
- ip: 10.245.0.3
hostnames:
- kubernetes.default
- kubernetes.default.svc
priority: 0
enableServiceLinks: false
preemptionPolicy: PreemptLowerPriority
@giorio94 here is the yaml file of the offloaded pod. Note that we are offloading a namespace.
Are there any challenges in making this work on kubernetes 1.24? If you point to me in write direction to debug, I can help.
The issue with k8s 1.24 and above is that by default service accounts do no longer generate the corresponding secret containing the authorization token, but that can be requested through an appropriate API. We are currently working on the support for that, which should be ready in the next weeks.
@saushind-tibco The pod specification seems correct, and configured to support contacting the home API server.
What is the output of curl -vk https://10.245.0.3
from the offloaded pod?
@giorio94 Here is the output for curl -vk https://10.245.0.3
$ curl -vk https://10.245.0.3
* Trying 10.245.0.3:443...
* Connected to 10.245.0.3 (10.245.0.3) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* Operation timed out after 300497 milliseconds with 0 out of 0 bytes received
* Closing connection 0
curl: (28) Operation timed out after 300497 milliseconds with 0 out of 0 bytes received
Ok, this looks strange, since it seems that the TCP connection is established correctly, but then TLS handhake does not complete. What if you start a pod in the origin cluster, and you perform curl -vk https://10.0.0.1
?
$ curl -vk https://10.0.0.1
* Trying 10.0.0.1:443...
* Connected to 10.0.0.1 (10.0.0.1) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=apiserver
* start date: Nov 28 05:39:02 2022 GMT
* expire date: Nov 28 05:49:02 2024 GMT
* issuer: CN=ca
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55eef256d2c0)
> GET / HTTP/2
> Host: 10.0.0.1
> user-agent: curl/7.74.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 401
< audit-id: 7e79ad07-50b9-4179-b053-8a8bc08645867
< cache-control: no-cache, private
< content-type: application/json
< content-length: 157
< date: Fri, 02 Dec 2022 10:14:12 GMT
<
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
* Connection #0 to host 10.0.0.1 left intact
}
I've recreated the setup on AKS, and I could reproduce the issue. From an initial investigation, the problem seems to be somewhat related to IP fragmentation. I'll try to look more in depth into this in the next days.
@giorio94 Thank you, please keep us posted
Hi @giorio94, any updates on this issue?
Not yet, sorry. These two weeks have been quite busy on my side. I'll take a detailed look soon.
@giorio94 do we have any timeline for this fix.
@giorio94 Any update on this? Our future releases are stuck because of this issue. Is it possible to expedite this?
@agulhane-tibco Liqo is an open-source project, mainly funded through European research funds. For urgent feature requests and/or bug fixing, please consider buying a support ticket from our partner company.