netbox-chart icon indicating copy to clipboard operation
netbox-chart copied to clipboard

housekeeping stuck in ContainerCreating mode

Open Atoms opened this issue 3 years ago • 16 comments

When housekeeping and persistence are enabled pods with volumes should be started on the same nodes, at least on GCP, as GCP can mount any from default storage classes only to one node at the same time. By looking into CronJob template i see that it tries to mount those volumes, but then again, reading https://demo.netbox.dev/static/docs/administration/housekeeping/ documentation housekeeping needs only database access.

Another possible solution would be to write podAffinity so that cronjob is started on node together with netbox application

Atoms avatar Dec 20 '21 12:12 Atoms

We could probably remove some of the volume mounts from the housekeeping job, though I can see how it might be needed in future.

I wouldn't want to second guess about the affinities. You might be running with ReadWriteMany volumes (e.g. NFS, EFS, or similar) which are fine to be mounted across multiple nodes. If you're using a ReadWriteOnce volume then it's up to you to supply the correct affinity settings to allow that to work.

I'd happily accept PRs to make the volume mounting in the housekeeping job optional, and also to improve the documentation about this problem and add example affinity settings.

bootc avatar Dec 20 '21 12:12 bootc

Sure, i will start working on this :)

Atoms avatar Dec 20 '21 12:12 Atoms

Actually it is same situation with worker pod too.

Atoms avatar Dec 20 '21 12:12 Atoms

by working on this, found also, that, if ReadWriteOnce is used, then housekeeping should be disabled, cause CronJob does not have PodAffinity definition

Atoms avatar Dec 22 '21 09:12 Atoms

There's a housekeeping.affinity setting that does what you want I think?

bootc avatar Jan 02 '22 16:01 bootc

CronJob does not support affinity as per spec: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/cron-job-v1/

Atoms avatar Jan 02 '22 20:01 Atoms

CronJob doesn't need to support it, it's not its job. All a CronJob does is create Jobs from the jobTemplate based on the schedule. Similarly, Jobs don't do much more than create Pods based on their template field. That template, though, is a full Pod spec, and that's available in the spec.jobTemplate.spec.template.spec field. I haven't tested it, but I don't see any reason the affinity field we've already got in our housekeeping CronJob template wouldn't work.

bootc avatar Jan 02 '22 22:01 bootc

ok, my bad, i saw error about that affinity when tried on cronjob, but that was cause of spacing wrong in definition.

Atoms avatar Jan 03 '22 07:01 Atoms

@Atoms is right about the spacing in affinity definition. It should be nindent 12 instead of nindent 8

There 3 lines need to be fixed. Should I open a PR? https://github.com/bootc/netbox-chart/blob/master/templates/cronjob.yaml#L161 https://github.com/bootc/netbox-chart/blob/master/templates/cronjob.yaml#L165 https://github.com/bootc/netbox-chart/blob/master/templates/cronjob.yaml#L169

miso231 avatar Jan 11 '22 08:01 miso231

Oh d'oh. Yes please, a PR would be really handy, thanks.

bootc avatar Jan 11 '22 09:01 bootc

i would like to revive this as it still exists... this is what i have:

admin@azure-box:~/Netbox/internal$ kubectl get pod -n netbox -o wide
NAME                                 READY   STATUS              RESTARTS   AGE     IP             NODE                                NOMINATED NODE   READINESS GATES
netbox-64d9d994c5-g44rg              1/1     Running             0          2d22h   10.244.1.101   aks-nodepool1-27323697-vmss000000   <none>           <none>
netbox-housekeeping-27450720-ktlgb   0/1     ContainerCreating   0          2d10h   <none>         aks-nodepool1-27323697-vmss00000c   <none>           <none>
netbox-postgresql-0                  1/1     Running             0          2d22h   10.244.9.50    aks-nodepool1-27323697-vmss00000a   <none>           <none>
netbox-redis-master-0                1/1     Running             0          2d22h   10.244.0.29    aks-nodepool1-27323697-vmss00000e   <none>           <none>
netbox-redis-replicas-0              1/1     Running             0          2d22h   10.244.3.24    aks-nodepool1-27323697-vmss00000c   <none>           <none>
netbox-redis-replicas-1              1/1     Running             0          2d22h   10.244.2.79    aks-nodepool1-27323697-vmss000001   <none>           <none>
netbox-redis-replicas-2              1/1     Running             0          2d22h   10.244.1.102   aks-nodepool1-27323697-vmss000000   <none>           <none>
netbox-worker-5b74cfd4-rnxqk         1/1     Running             2          2d22h   10.244.1.100   aks-nodepool1-27323697-vmss000000   <none>           <none>

and this is the describe

admin@azure-box:~/Netbox/internal$ kubectl describe pod netbox-housekeeping-27450720-ktlgb -n netbox
Name:           netbox-housekeeping-27450720-ktlgb
Namespace:      netbox
Priority:       0
Node:           aks-nodepool1-27323697-vmss00000c/172.19.128.7
Start Time:     Sat, 12 Mar 2022 00:00:00 +0000
Labels:         app.kubernetes.io/component=housekeeping
                app.kubernetes.io/instance=netbox
                app.kubernetes.io/name=netbox
                controller-uid=7f27bfeb-a813-4065-aa83-4921e34e0b23
                job-name=netbox-housekeeping-27450720
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/netbox-housekeeping-27450720
Containers:
  netbox-housekeeping:
    Container ID:  
    Image:         netboxcommunity/netbox:v3.0.11
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/netbox/venv/bin/python
      /opt/netbox/netbox/manage.py
      housekeeping
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/netbox/config/configuration.py from config (ro,path="configuration.py")
      /opt/netbox/netbox/media from media (rw)
      /run/config/netbox from config (ro)
      /run/secrets/netbox from secrets (ro)
      /tmp from netbox-tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-chp5t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      netbox
    Optional:  false
  secrets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          netbox
    SecretOptionalName:  <nil>
    SecretName:          netbox-postgresql
    SecretOptionalName:  <nil>
    SecretName:          netbox-redis
    SecretOptionalName:  <nil>
    SecretName:          netbox-redis
    SecretOptionalName:  <nil>
  netbox-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  media:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  netbox-media
    ReadOnly:   false
  kube-api-access-chp5t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                      From     Message
  ----     ------       ----                     ----     -------
  Warning  FailedMount  58m (x248 over 2d10h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[media], unattached volumes=[secrets netbox-tmp media kube-api-access-chp5t config]: timed out waiting for the condition
  Warning  FailedMount  22m (x483 over 2d10h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[media], unattached volumes=[config secrets netbox-tmp media kube-api-access-chp5t]: timed out waiting for the condition
  Warning  FailedMount  17m (x260 over 2d10h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[media], unattached volumes=[netbox-tmp media kube-api-access-chp5t config secrets]: timed out waiting for the condition
  Warning  FailedMount  12m (x268 over 2d10h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[media], unattached volumes=[kube-api-access-chp5t config secrets netbox-tmp media]: timed out waiting for the condition
  Warning  FailedMount  3m46s (x277 over 2d10h)  kubelet  Unable to attach or mount volumes: unmounted volumes=[media], unattached volumes=[media kube-api-access-chp5t config secrets netbox-tmp]: timed out waiting for the condition

here are all the PVCs

admin@azure-box:~/Netbox/internal$ kubectl get pvc -n netbox
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-netbox-postgresql-0             Bound    pvc-0cbe8766-8a6d-498d-a0d3-fb695c1cf212   8Gi        RWO            default        2d22h
netbox-media                         Bound    pvc-63a58576-8cc3-415a-9152-96a886ecb611   30Gi       RWO            default        2d22h
redis-data-netbox-redis-master-0     Bound    pvc-ab91656d-0d76-4c26-b24b-d40978830d74   8Gi        RWO            default        2d22h
redis-data-netbox-redis-replicas-0   Bound    pvc-dcc5762e-a67b-4803-8dfe-df268865e8cd   8Gi        RWO            default        2d22h
redis-data-netbox-redis-replicas-1   Bound    pvc-c6a23abf-1367-4f2f-bb96-d0150a1c53a7   8Gi        RWO            default        2d22h
redis-data-netbox-redis-replicas-2   Bound    pvc-2f925079-a52b-49ff-86a5-80a41218ef92   8Gi        RWO            default        2d22h

anubisg1 avatar Mar 14 '22 10:03 anubisg1

additional information, if i added podaffinity under "netbox-housekeeping" , that cannot be used.

here is values.yml extract

  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
       - weight: 50
         podAffinityTerm:
           labelSelector:
             matchExpressions:
             - key: app.kubernetes.io/component
               operator: In
               values:
                 - netbox
           topologyKey: "kubernetes.io/hostname"

but that cannot be applied:

Error: UPGRADE FAILED: error validating "": error validating data: ValidationError(CronJob.spec.jobTemplate.spec.template): unknown field "podAffinity" in io.k8s.api.core.v1.PodTemplateSpec

anubisg1 avatar Mar 15 '22 08:03 anubisg1

I had issues with the netbox and netbox-worker pods, but adding these affinity rules to values.yaml ensured they both started on the same node:

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: kubernetes.io/hostname
      labelSelector:
        matchExpressions:
        - key: app.kubernetes.io/component
          operator: In
          values:
          - worker
          - netbox

worker:          
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - worker
            - netbox

adamrushuk avatar Mar 24 '22 17:03 adamrushuk

I had issues with the netbox and netbox-worker pods, but adding these affinity rules to values.yaml ensured they both started on the same node:

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: kubernetes.io/hostname
      labelSelector:
        matchExpressions:
        - key: app.kubernetes.io/component
          operator: In
          values:
          - worker
          - netbox

worker:          
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - worker
            - netbox

did you need netbox and the worker to be on the same node? here we were talking about netbox and the netbox-housekeeping

anubisg1 avatar Mar 25 '22 09:03 anubisg1

did you need netbox and the worker to be on the same node?

yes, as they share a common volume. I expect this will be the case with the netbox-housekeeping pod too.

Looking at the code, the component label is housekeeping: https://github.com/bootc/netbox-chart/blob/master/templates/cronjob.yaml#L16

so try this in your values.yaml:

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - topologyKey: kubernetes.io/hostname
      labelSelector:
        matchExpressions:
        - key: app.kubernetes.io/component
          operator: In
          values:
          - housekeeping
          - netbox

housekeeping:          
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname
        labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values:
            - housekeeping
            - netbox

adamrushuk avatar Mar 25 '22 10:03 adamrushuk

PR #89 is a good start at a documentation update to "fix" this issue.

bootc avatar Apr 26 '22 19:04 bootc