logging-operator icon indicating copy to clipboard operation
logging-operator copied to clipboard

FluentDConfigCheck does not get scheduled du to inherited anti-affinity rules

Open timbrd opened this issue 3 years ago • 10 comments

I have added the following pod affinity rules to my fluentd config to ensure that the pods are spread over the nodes:

    fluentd:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: "app.kubernetes.io/name"
                  operator: In
                  values:
                    - fluentd
                - key: "app.kubernetes.io/component"
                  operator: In
                  values:
                    - fluentd
            topologyKey: "kubernetes.io/hostname"

Now the fluentd-configcheck pods stay in pending state. It seems that they inherit the configuraton of the fluentd statefulsets.

18m         Warning   FailedScheduling    pod/logging-operator-fluentd-configcheck-e97a6f8f     0/8 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {CriticalAddonsOnly: true}, that the pod didn't tolerate.

Is there a way to make sure that the fluentd pods are distributed, but the configcheck pods are still scheduled?

timbrd avatar Sep 01 '21 17:09 timbrd

Currently there is no way other than disabling the configcheck unfortunately. With a few changes it can be made configurable separately though.

pepov avatar Sep 02 '21 05:09 pepov

Thanks, I have removed the affinity rules for now. My other idea, to reduce the number of fluentd replicas, to be able to run the configheck periodically, works only as long as all the nodes are available and ready.

I think it was useful being able to configure the configcheck sperately.

timbrd avatar Sep 02 '21 07:09 timbrd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Apr 12 '23 10:04 stale[bot]

Can't we just filter out the pod anti affinities altogether for fluentd configcheck in here? https://github.com/kube-logging/logging-operator/blob/e0331c4b508ff54e8b0958d29ace7e8d7427674b/pkg/resources/fluentd/appconfigmap.go#L222

aslafy-z avatar Apr 14 '23 20:04 aslafy-z

yes, I don't think it makes sense to use the affinity rules in the configcheck pod

pepov avatar Apr 16 '23 14:04 pepov

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Jun 15 '23 15:06 stale[bot]

I still don't think there is a good reason to apply affinity rules to the configcheck pod, but I also don't want to break this for those who might use it (even if it's accidental right now).

The good solution would be to add override options for the configcheck pod, but the logging resource is already very big and require restructuring.

So the options I see:

  • we disable inheriting affinity rules (affinity and antiaffinity as well) for configcheck pods (backwards incompatible)
  • we add a flag to control weather we want to allow inheriting the rules or not, but then:
    • enable flag (disable inheriting by default, which could break existing code, let users opt-in)
    • disable flag (enable inheriting by default, but let users enable it, which is not a good experience if most of the time it's not needed)
  • we live with this and say configcheck pod requires an extra node in this specific case.

What do you think?

also cc @ahma @tarokkk

pepov avatar Jun 27 '23 12:06 pepov

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

stale[bot] avatar Sep 02 '23 11:09 stale[bot]

Any updates?

rerime avatar Jul 30 '24 10:07 rerime

After this time I would simply go with this: https://github.com/kube-logging/logging-operator/pull/1787

pepov avatar Jul 30 '24 13:07 pepov