Host receiving constant inserts will never succeed in the `wait for query finish` check
For observability/telemetry data if a host is constantly receiving inserts it ends up always waiting for the full wait for query finish interval to rollout and restart.
We use a service template for ingest-<cluster>, and insert directly to these pods (no distributed insert). Feature request would be to allow removing these pods from this service template before waiting for queries to finish. So that we do not have to interrupt the inserts, but also gives us the control to shut them down a bit quicker
@tanner-bruce , you can disable wait for queries globally or on a CHI level.
In operator configuration:
spec:
reconcile:
host:
wait:
queries: "false"
In CHI:
spec:
reconciling:
policy: nowait
@alex-zaitsev my understanding was that this will immediately restart the pod. I would prefer to have currently running inserts finish since they can be relatively large. I think setting ready: false before waiting would accomplish this
Could you share service template for the service you are using for inserts?
This is set as the clusterServiceTemplate
- generateName: chi-{cluster}-cluster
metadata:
annotations:
cloud.google.com/load-balancer-type: Internal
networking.gke.io/internal-load-balancer-allow-global-access: "true"
name: core-cluster-service-template
spec:
ports:
- name: http
port: 8123
targetPort: 8123
- name: tcp
port: 9000
targetPort: 9000
- name: https
port: 8443
protocol: TCP
targetPort: 8443
- name: tcp-tls
port: 9440
protocol: TCP
targetPort: 9440
type: LoadBalancer
...will never succeed in the wait for query finish check..
Have you encountered or tested this? Wait process is expected to last not more than time.Duration(config.Reconcile.StatefulSet.Update.Timeout) * time.Second which is 300 seconds by default.
...will never succeed in the wait for query finish check..
Have you encountered or tested this? Wait process is expected to last not more than
time.Duration(config.Reconcile.StatefulSet.Update.Timeout) * time.Secondwhich is 300 seconds by default.
Yes, they always wait the entire timeout and then fail queries anyways, since they continue receiving inserts. It slows down the reconciliation loop for a CHI unnecessarily, 300 * # of pods is a long time with 20+ pods