vcluster vCluster creating more trouble than helping(due to different causes)

What happened?

I honestly if it is supposed to be that hard but vcluster is creating more trouble than solutions...i've been working on getting a prod ready cluster for over a week and its not working. Right now the cluster is up and running and connect to it using NodePort Service... the issues:

When i deploy the cluster, it takes over 60min for the cluster Pods to be running in healthy state , so i can communicate with the vcluster
the coredns pod though in state running is full of errors:

[ERROR] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized

[INFO] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: failed to list *v1.Service: Unauthorized 
[ERROR] plugin/kubernetes: pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized 

[INFO] plugin/kubernetes: Trace[944124959]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:55:04.692) (total time: 16726ms): Trace[944124959]: ---"Objects listed" error:<nil> 16726ms (11:55:21.419)     Trace[944124959]: [16.726980037s] [16.726980037s] END                                                                                                                                                                                    

[INFO] plugin/kubernetes: Trace[1216864093]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:54:50.867) (total time: 30559ms): = Trace[1216864093]: ---"Objects listed" error:<nil> 30559ms (11:55:21.426)    Trace[1216864093]: [30.55934034s] [30.55934034s] END                                                                                                                                                                                     

[INFO] plugin/kubernetes: Trace[1087029931]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:231 (24-May-2024 11:54:49.127) (total time: 32304ms):  Trace[1087029931]: ---"Objects listed" error:<nil> 32304ms (11:55:21.432)                                                                                                                                                            Trace[1087029931]: [32.304771091s] [32.304771091s] END

When connected to the vcluster, requests delivers different responses each time: EG: Running kubectl get namespaces might show 4 Namespaces; then 4, and then 6 etc.
Running helm on the vcluster is nearly impossible..it times out nearly every single time..

This is all bordering because i was expecting vCluster way easier to use.

What did you expect to happen?

i deployed a cluster and expected the coreDNS pods to be deployed

How can we reproduce it (as minimally and precisely as possible)?

# vcluster.yaml
exportKubeConfig:
  context: "sharedpool-context"
controlPlane:
  coredns:
    enabled: true
    embedded: false
    deployment:
      replicas: 2
      nodeSelector:
        workload: wk1
  statefulSet:
    highAvailability:
      replicas: 2
    persistence:
      volumeClaim:
        enabled: true
    scheduling:
      nodeSelector:
        workload: wk1
    resources:
      limits:
        ephemeral-storage: 20Gi
        memory: 10Gi
      requests:
        ephemeral-storage: 200Mi
        cpu: 200m
        memory: 256Mi
    
  proxy:
    bindAddress: "0.0.0.0"
    port: 8443
    extraSANs:
      - XX.XX.XX.XXX
      - YY.YY.YY.YYY

helm upgrade -i my-vcluster vcluster \
  --repo https://charts.loft.sh \
  --namespace vcluster-ns --create-namespace \
  --repository-config='' \
  -f vcluster.yaml \
  --version 0.20.0-beta.5

Anything else we need to know?

i used a nodePort service to connect to the cluster

# nodeport.yaml

apiVersion: v1
kind: Service
metadata:
  name: vcluster-nodeport
  namespace: vcluster-ns
spec:
  selector:
    app: vcluster
    release: shared-pool-vcluster
  ports:
    - name: https
      port: 443
      targetPort: 8443
      protocol: TCP
      nodePort: 31222
  type: NodePort

$ helm upgrade -i solr-operator apache-solr/solr-operator --version 0.8.1 -n solr-cloud

Release "solr-operator" does not exist. Installing it now.
Error: failed post-install: 1 error occurred:
        * timed out waiting for the condition


$ helm upgrade -i hz-operator hazelcast/hazelcast-platform-operator -n hz-vc-ns --create-namespace -f operator.yaml

Release "hz-operator" does not exist. Installing it now.
Error: 9 errors occurred:
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Timeout: request did not complete within requested timeout - context deadline exceeded
        * Internal error occurred: resource quota evaluation timed out

Host cluster Kubernetes version

$ kubectl version
# paste output here

Host cluster Kubernetes distribution

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.11

vlcuster version

$ vcluster --version
vcluster version 0.19.5

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

default..
# i did no specify a speciffc distribution

OS and Arch

OS: talos
Arch: metal-amd64

May 22 '24 09:05 MichaelKora

hey @MichaelKora , it's unfortunate that have to experience these troubles.

One thing that I'd recommend is to use the latest vcluster CLI, together with 0.20.0-beta.5. From the description it appears that you using the 0.19.5 one instead.

Regarding the other issues: It's a bit hard to say from the outset what might causing your issues. You seem to be leveraging Talos. What Kubernetes distro is running on top of it?

Jun 03 '24 07:06 heiko-braun

hey @heiko-braun 0.19.5 is the latest according to vcluster cli

$ sudo vcluster upgrade
15:55:48 info Current binary is the latest version: 0.19.5

i have a TalosRunning there( the default image) its based on k3s

Jun 04 '24 14:06 MichaelKora

Hi @MichaelKora, you can get the latest CLI (the one to be used with 0.20 vcluster.yaml) here: https://github.com/loft-sh/vcluster/releases/tag/v0.20.0-beta.6

Jun 05 '24 09:06 heiko-braun

@MichaelKora the hazelcast and solr examples in the description, did you run the commands against the host cluster or the virtual one?

Jun 05 '24 09:06 heiko-braun

@heiko-braun thanks for your response I run the command against the vcluster... when run against the host cluster, I have no issues

Jun 05 '24 11:06 MichaelKora

@heiko-braun when the cluster is being created, the logs show:

 2024-06-05 14:38:37 INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"} 

 2024-06-05 14:38:38    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), wil l retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:39    INFO    setup/controller_context.go:196    couldn't retrieve virtual cluster version (Get "https://127.0.0.1:6443/version": dial tcp 127.0.0.1:6443: connect: connection refused), will retry in 1 seconds    {"component": "vcluster"}

 2024-06-05 14:38:40    INFO    commandwriter/commandwriter.go:126    error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 127.0.0.1:6443: connect: connection refused    {"component": "vcluster", "component": "controller-manager", "location": "leaderelection.go:33 2"}

and it takes more than 60min before bringing the cluster to an healthy state..that seems verry odd to me that it takes that long to create a virt cluster

Jun 05 '24 14:06 MichaelKora

@MichaelKora how many nodes does your host cluster have, and what capacity? do you use network policies?

Jun 10 '24 11:06 everflux

hey @everflux i dedicated 2nodes of the host cluster to the vcluster...8cpu/32GB..i am not using any restrictive network policies

Jun 10 '24 13:06 MichaelKora

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

Jun 13 '24 21:06 everflux

@MichaelKora Are you still having issues or were you able to resolve them?

Sep 03 '24 20:09 deniseschannon

hey @deniseschannon, yes i am still having the issue!

Sep 05 '24 07:09 MichaelKora

This sounds like a setup problem to me, either with the host cluster or vcluster. Did you try to setup one or multiple vclusters? (check kubectl get all -n vcluster-ns, kubectl get ns) I am afraid a github issue might not be the right place to discuss this, perhaps the slack channel would be better suited.

@everflux i have just one setup

Sep 06 '24 08:09 MichaelKora

@deniseschannon @heiko-braun @everflux Any update on the origin of that issue and how to fix it?

Nov 27 '24 10:11 MichaelKora

I think the slack channel or direct consulting is a better place for support than a github issue in this case.

Nov 27 '24 10:11 everflux

@MichaelKora - Using slack would be better when troubleshooting and if you could try the latest version of vcluster, that would also be great. When troubleshooting, with v0.20+, it would be great if you could also provide your vcluster.yaml.

I'm going to close this issue and if you are still having issues, can you open a new one? Thanks!

Jan 01 '25 16:01 deniseschannon