neonKUBE icon indicating copy to clipboard operation
neonKUBE copied to clipboard

Large cluster deployment with 8 GiB RAM fails

Open jefflill opened this issue 1 year ago • 2 comments

This appears to be a cluster advice issue. In this case, the tempo-ingester pods are not able to be scheduled:

Name:                 tempo-ingester-0
Namespace:            neon-monitor
Priority:             900000000
Priority Class Name:  neon-min
Node:                 <none>
Labels:               app.kubernetes.io/component=ingester
                      app.kubernetes.io/instance=tempo
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=tempo
                      app.kubernetes.io/version=1.3.2
                      controller-revision-hash=tempo-ingester-848cd8689
                      helm.sh/chart=tempo-distributed-0.16.9
                      statefulset.kubernetes.io/pod-name=tempo-ingester-0
                      tempo-gossip-member=true
Annotations:          checksum/config: c764e248482a115a73aaa4678cf3e9a5b9ead286adccfc738cfc4a2e3f314e1c
                      sidecar.istio.io/inject: false
                      traffic.sidecar.istio.io/excludeInboundPorts: 7946
                      traffic.sidecar.istio.io/excludeOutboundPorts: 7946
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        StatefulSet/tempo-ingester
Containers:
  ingester:
    Image:       registry.neon.local/neonkube/grafana-tempo:2.0.0
    Ports:       9095/TCP, 7946/TCP, 3100/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Args:
      -target=ingester
      -config.file=/conf/tempo.yaml
      -mem-ballast-size-mbs=64
      -config.expand-env=true
    Limits:
      memory:  1Gi
    Requests:
      memory:   1Gi
    Readiness:  http-get http://:http/ready delay=30s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ACCESS_KEY_ID:      <set to the key 'accesskey' in secret 'minio'>  Optional: false
      SECRET_ACCESS_KEY:  <set to the key 'secretkey' in secret 'minio'>  Optional: false
      GOGC:               10
    Mounts:
      /conf from tempo-conf (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xknb7 (ro)
      /var/tempo from data (rw)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-tempo-ingester-0
    ReadOnly:   false
  tempo-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tempo
    Optional:  false
  kube-api-access-xknb7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              node.neonkube.io/monitor.traces-internal=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12m                default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 Insufficient memory, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) didn't find available persistent volumes to bind. preemption: 0/6 nodes are available: 1 Insufficient memory, 5 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  11m (x4 over 12m)  default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 Insufficient memory, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 1 Insufficient memory, 5 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  4m (x5 over 11m)   default-scheduler  0/6 nodes are available: 3 Insufficient memory, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 5 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 1 Insufficient memory, 5 Preemption is not helpful for scheduling.```

  I've temporarily reset the node RAM for these test clusters: 8 GiB --> 16 GiB

jefflill avatar Jul 21 '23 19:07 jefflill

@marcusbooyah looked at this and it's a problem with cluster advice. He hacked around this for clusters with 10 nodes or less but we'll need to put some more effort into how cluster advice works.

jefflill avatar Jul 23 '23 17:07 jefflill

For now we should just recommend 16GB minimum

marcusbooyah avatar Feb 19 '24 20:02 marcusbooyah