clickhouse-operator
                                
                                 clickhouse-operator copied to clipboard
                                
                                    clickhouse-operator copied to clipboard
                            
                            
                            
                        Clickhouse replica cannot reach other replicas and crashes with an DNSResolver error
I have a Clickhouse cluster consisting of 1 shard and 3 replicas. Everything is deployed to Kubernetes using Altinity Clickhouse Operator.
My Environment:
- Docker image: clickhouse/clickhouse-server:23.3.8.21-alpine.
- Kubernetes 1.19.2
- Altinity Clickhouse Operator 0.19.0
Situation: ALL three replicas crash due to various reasons (node failure or HDD/SSD failure). After some time, one of the replicas tries to go up (the other two are also unavailable at this moment) and crashes with a DNSResolver error(All Zookeeper nodes are available at this moment).
Question: can I achieve a behavior in which one replica will be raised ignoring the unavailability of others?
In the documentation, I found the skip_unavailable_shards and dns_max_consecutive_failures parameters. Maybe a combination of these parameters will solve my case?
Could you show clickhouse or clickhouse-pod container log with error from chi-{chi-name}-{cluster-name}-{shard}-{replica}-0 pod?
try change CHI
spec:
  configuration:
    settings:
      disable_internal_dns_cache: 1
Enabled disable_internal_dns_cache in my cluster config. Turned off all replicas. Then tried to lift one of them(All Zookeeper nodes were available at that moment). Result:
Settings:
Also tried adding the skip_unavailable_shards setting:
Result - the replica also crashes on DNS_ERROR
skip_unaavailable_shards is unrelated,
could you share
kubectl get chi --all-namespaces ?
All CHI in my test k8s cluster:
NAMESPACE    NAME          CLUSTERS      HOSTS       STATUS        AGE
infra-test   grif-chi-ss     1          3         Completed       386d
could you share
kubectl get chi -n infra-test grif-chi-ss -o yaml
without sensitive credentials?
@Programmeris , what is a reason of using outdated versions of Kubernetes and operator? Please use the latest operator version, and also consider Kubernetes upgrade. 1.19 EOL was in 2021.
@alex-zaitsev At the moment, there is no way to update specifically these k8s for various reasons. An update is planned for the future. However, I'm not sure if my problem is related to an outdated version of k8s or operator
i'm still wait CHI resource https://github.com/Altinity/clickhouse-operator/issues/1212#issuecomment-1660246872
@Slach CHI yaml definition in my test k8s cluster(managed fields and credentials were removed): chi.txt
At the moment the replicas are working fine. The problem is reproduced if you shutdown all the replicas at once and then include one by one
The problem is reproduced if you shutdown all the replicas at once and then include one by one
Could you provide commands which exactly do this sequence?
Is shared chi.txt was edited manually?
Do you use kind: ClickHouseInstallationTemplate manifests?
Could you share?
kubectl get chit --all-namespaces
Maybe you edited default clickhouse-operator config?
Could you share?
kubeclt get cm -n infra-test etc-clickhouse-operator-files -o yaml
I see
status:
  fqdns:
  - chi-grif-chi-ss-test-facts-sgl-0-0.infra-test.svc.cluster.local
  - chi-grif-chi-ss-test-facts-sgl-0-1.infra-test.svc.cluster.local
  - chi-grif-chi-ss-test-facts-sgl-0-2.infra-test.svc.cluster.local
which not correlated with
kind: ClickHouseInstallation
metadata:
  name: chi-grif-chi-ss
...
spec:
  configuration:
    clusters:
    - layout:
        replicasCount: 3
        shardsCount: 1
      name: grif-facts-sgl
  
service names have default name conversion chi-{chi_name}-{cluster_name}-{shard_index}-{replica_index}
so it should be chi-chi-grif-chi-ss-grif-facts-sgl-0-0
this is different with  chi-grif-chi-ss-test-facts-sgl-0-0