vcluster icon indicating copy to clipboard operation
vcluster copied to clipboard

vCluster creating more trouble than helping(due to different causes)

Open MichaelKora opened this issue 1 year ago • 12 comments

What happened?

I honestly if it is supposed to be that hard but vcluster is creating more trouble than solutions...i've been working on getting a prod ready cluster for over a week and its not working. Right now the cluster is up and running and connect to it using NodePort Service... the issues:

  1. When i deploy the cluster, it takes over 60min for the cluster Pods to be running in healthy state , so i can communicate with the vcluster
  2. the coredns pod though in state running is full of errors:
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized

[INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Service: Unauthorized 
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized 

[INFO] plugin/kubernetes: Trace[944124959]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:55:04.692) (total time: 16726ms): Trace[944124959]: ---"Objects listed" error:<nil> 16726ms (11:55:21.419)     Trace[944124959]: [16.726980037s] [16.726980037s] END                                                                                                                                                                                    

[INFO] plugin/kubernetes: Trace[1216864093]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:54:50.867) (total time: 30559ms): = Trace[1216864093]: ---"Objects listed" error:<nil> 30559ms (11:55:21.426)    Trace[1216864093]: [30.55934034s] [30.55934034s] END                                                                                                                                                                                     

[INFO] plugin/kubernetes: Trace[1087029931]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:54:49.127) (total time: 32304ms):  Trace[1087029931]: ---"Objects listed" error:<nil> 32304ms (11:55:21.432)                                                                                                                                                            Trace[1087029931]: [32.304771091s] [32.304771091s] END 
  1. When connected to the vcluster, requests delivers different responses each time: EG: Running kubectl get namespaces might show 4 Namespaces; then 4, and then 6 etc.

  2. Running helm on the vcluster is nearly impossible..it times out nearly every single time..

This is all bordering because i was expecting vCluster way easier to use.

What did you expect to happen?

i deployed a cluster and expected the coreDNS pods to be deployed

How can we reproduce it (as minimally and precisely as possible)?

# vcluster.yaml
exportKubeConfig:
  context: "sharedpool-context"
controlPlane:
  coredns:
    enabled: true
    embedded: false
    deployment:
      replicas: 2
      nodeSelector:
        workload: wk1
  statefulSet:
    highAvailability:
      replicas: 2
    persistence:
      volumeClaim:
        enabled: true
    scheduling:
      nodeSelector:
        workload: wk1
    resources:
      limits:
        ephemeral-storage: 20Gi
        memory: 10Gi
      requests:
        ephemeral-storage: 200Mi
        cpu: 200m
        memory: 256Mi
    
  proxy:
    bindAddress: "0.0.0.0"
    port: 8443
    extraSANs:
      - XX.XX.XX.XXX
      - YY.YY.YY.YYY
helm upgrade -i my-vcluster vcluster \
  --repo https://charts.loft.sh \
  --namespace vcluster-ns --create-namespace \
  --repository-config='' \
  -f vcluster.yaml \
  --version 0.20.0-beta.5

Anything else we need to know?

i used a nodePort service to connect to the cluster

# nodeport.yaml

apiVersion: v1
kind: Service
metadata:
  name: vcluster-nodeport
  namespace: vcluster-ns
spec:
  selector:
    app: vcluster
    release: shared-pool-vcluster
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
      nodePort: 31222
  type: NodePort
$ helm upgrade -i solr-operator apache-solr/solr-operator --version 0.8.1 -n solr-cloud

Release "solr-operator" does not exist. Installing it now.
Error: failed post-install: 1 error occurred:
        * timed out waiting for the condition


$ helm upgrade -i hz-operator hazelcast/hazelcast-platform-operator -n hz-vc-ns --create-namespace -f operator.yaml

Release "hz-operator" does not exist. Installing it now.
Error: 9 errors occurred:
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Internal error occurred: resource quota evaluation timed out

Host cluster Kubernetes version

$ kubectl version
# paste output here

Host cluster Kubernetes distribution

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11

vlcuster version

$ vcluster --version
vcluster version 0.19.5

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

default..
# i did no specify a speciffc distribution

OS and Arch

OS: talos
Arch: metal-amd64

MichaelKora avatar May 22 '24 09:05 MichaelKora

hey @MichaelKora , it's unfortunate that have to experience these troubles.

One thing that I'd recommend is to use the latest vcluster CLI, together with 0.20.0-beta.5. From the description it appears that you using the 0.19.5 one instead.

Regarding the other issues: It's a bit hard to say from the outset what might causing your issues. You seem to be leveraging Talos. What Kubernetes distro is running on top of it?

heiko-braun avatar Jun 03 '24 07:06 heiko-braun

hey @heiko-braun 0.19.5 is the latest according to vcluster cli

$ sudo vcluster upgrade
15:55:48 info Current binary is the latest version: 0.19.5

i have a TalosRunning there( the default image) its based on k3s

MichaelKora avatar Jun 04 '24 14:06 MichaelKora

Hi @MichaelKora, you can get the latest CLI (the one to be used with 0.20 vcluster.yaml) here: https://github.com/loft-sh/vcluster/releases/tag/v0.20.0-beta.6

heiko-braun avatar Jun 05 '24 09:06 heiko-braun

@MichaelKora the hazelcast and solr examples in the description, did you run the commands against the host cluster or the virtual one?

heiko-braun avatar Jun 05 '24 09:06 heiko-braun

@heiko-braun thanks for your response I run the command against the vcluster... when run against the host cluster, I have no issues

MichaelKora avatar Jun 05 '24 11:06 MichaelKora

@heiko-braun when the cluster is being created, the logs show:

 2024-06-05 14:38:37 INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"} 

 2024-06-05 14:38:38    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), wil l retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:39    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:40    INFO    commandwriter/commandwriter.go:126    error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 127.0.0.1:6443: connect: connection refused    {"component": "vcluster", "component": "controller-manager", "location": "leaderelection.go:33 2"}

and it takes more than 60min before bringing the cluster to an healthy state..that seems verry odd to me that it takes that long to create a virt cluster

MichaelKora avatar Jun 05 '24 14:06 MichaelKora

@MichaelKora how many nodes does your host cluster have, and what capacity? do you use network policies?

everflux avatar Jun 10 '24 11:06 everflux

hey @everflux i dedicated 2nodes of the host cluster to the vcluster...8cpu/32GB..i am not using any restrictive network policies

MichaelKora avatar Jun 10 '24 13:06 MichaelKora

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

everflux avatar Jun 13 '24 21:06 everflux

@MichaelKora Are you still having issues or were you able to resolve them?

deniseschannon avatar Sep 03 '24 20:09 deniseschannon

hey @deniseschannon, yes i am still having the issue!

MichaelKora avatar Sep 05 '24 07:09 MichaelKora

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

@everflux i have just one setup

MichaelKora avatar Sep 06 '24 08:09 MichaelKora

@deniseschannon @heiko-braun @everflux Any update on the origin of that issue and how to fix it?

MichaelKora avatar Nov 27 '24 10:11 MichaelKora

I think the slack channel or direct consulting is a better place for support than a github issue in this case.

everflux avatar Nov 27 '24 10:11 everflux

@MichaelKora - Using slack would be better when troubleshooting and if you could try the latest version of vcluster, that would also be great. When troubleshooting, with v0.20+, it would be great if you could also provide your vcluster.yaml.

I'm going to close this issue and if you are still having issues, can you open a new one? Thanks!

deniseschannon avatar Jan 01 '25 16:01 deniseschannon