clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Host receiving constant inserts will never succeed in the `wait for query finish` check

Open tanner-bruce opened this issue 6 months ago • 6 comments

For observability/telemetry data if a host is constantly receiving inserts it ends up always waiting for the full wait for query finish interval to rollout and restart.

We use a service template for ingest-<cluster>, and insert directly to these pods (no distributed insert). Feature request would be to allow removing these pods from this service template before waiting for queries to finish. So that we do not have to interrupt the inserts, but also gives us the control to shut them down a bit quicker

tanner-bruce avatar Jun 19 '25 15:06 tanner-bruce

@tanner-bruce , you can disable wait for queries globally or on a CHI level.

In operator configuration:

spec:
  reconcile:
    host:
      wait:
        queries: "false"

In CHI:

spec:
  reconciling:
    policy: nowait

alex-zaitsev avatar Jun 19 '25 18:06 alex-zaitsev

@alex-zaitsev my understanding was that this will immediately restart the pod. I would prefer to have currently running inserts finish since they can be relatively large. I think setting ready: false before waiting would accomplish this

tanner-bruce avatar Jun 20 '25 14:06 tanner-bruce

Could you share service template for the service you are using for inserts?

alex-zaitsev avatar Jun 21 '25 06:06 alex-zaitsev

This is set as the clusterServiceTemplate

    - generateName: chi-{cluster}-cluster
      metadata:
        annotations:
          cloud.google.com/load-balancer-type: Internal
          networking.gke.io/internal-load-balancer-allow-global-access: "true"
      name: core-cluster-service-template
      spec:
        ports:
        - name: http
          port: 8123
          targetPort: 8123
        - name: tcp
          port: 9000
          targetPort: 9000
        - name: https
          port: 8443
          protocol: TCP
          targetPort: 8443
        - name: tcp-tls
          port: 9440
          protocol: TCP
          targetPort: 9440
        type: LoadBalancer

tanner-bruce avatar Jun 21 '25 12:06 tanner-bruce

...will never succeed in the wait for query finish check..

Have you encountered or tested this? Wait process is expected to last not more than time.Duration(config.Reconcile.StatefulSet.Update.Timeout) * time.Second which is 300 seconds by default.

sunsingerus avatar Oct 16 '25 09:10 sunsingerus

...will never succeed in the wait for query finish check..

Have you encountered or tested this? Wait process is expected to last not more than time.Duration(config.Reconcile.StatefulSet.Update.Timeout) * time.Second which is 300 seconds by default.

Yes, they always wait the entire timeout and then fail queries anyways, since they continue receiving inserts. It slows down the reconciliation loop for a CHI unnecessarily, 300 * # of pods is a long time with 20+ pods

tanner-bruce avatar Oct 20 '25 17:10 tanner-bruce