clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Clickhouse replica cannot reach other replicas and crashes with an DNSResolver error

Open Programmeris opened this issue 2 years ago • 11 comments

I have a Clickhouse cluster consisting of 1 shard and 3 replicas. Everything is deployed to Kubernetes using Altinity Clickhouse Operator.

My Environment:

  • Docker image: clickhouse/clickhouse-server:23.3.8.21-alpine.
  • Kubernetes 1.19.2
  • Altinity Clickhouse Operator 0.19.0

Situation: ALL three replicas crash due to various reasons (node ​​failure or HDD/SSD failure). After some time, one of the replicas tries to go up (the other two are also unavailable at this moment) and crashes with a DNSResolver error(All Zookeeper nodes are available at this moment).

Question: can I achieve a behavior in which one replica will be raised ignoring the unavailability of others?

In the documentation, I found the skip_unavailable_shards and dns_max_consecutive_failures parameters. Maybe a combination of these parameters will solve my case?

Programmeris avatar Jul 31 '23 11:07 Programmeris

Could you show clickhouse or clickhouse-pod container log with error from chi-{chi-name}-{cluster-name}-{shard}-{replica}-0 pod?

try change CHI

spec:
  configuration:
    settings:
      disable_internal_dns_cache: 1

Slach avatar Jul 31 '23 12:07 Slach

Enabled disable_internal_dns_cache in my cluster config. Turned off all replicas. Then tried to lift one of them(All Zookeeper nodes were available at that moment). Result:

Снимок экрана от 2023-08-01 11-24-29

Settings:

Снимок экрана от 2023-08-01 11-32-54

Programmeris avatar Aug 01 '23 08:08 Programmeris

Also tried adding the skip_unavailable_shards setting: Снимок экрана от 2023-08-01 12-49-00 Result - the replica also crashes on DNS_ERROR

Programmeris avatar Aug 01 '23 09:08 Programmeris

skip_unaavailable_shards is unrelated,

could you share kubectl get chi --all-namespaces ?

Slach avatar Aug 01 '23 10:08 Slach

All CHI in my test k8s cluster: NAMESPACE NAME CLUSTERS HOSTS STATUS AGE infra-test grif-chi-ss 1 3 Completed 386d

Programmeris avatar Aug 01 '23 12:08 Programmeris

could you share kubectl get chi -n infra-test grif-chi-ss -o yaml without sensitive credentials?

Slach avatar Aug 01 '23 12:08 Slach

@Programmeris , what is a reason of using outdated versions of Kubernetes and operator? Please use the latest operator version, and also consider Kubernetes upgrade. 1.19 EOL was in 2021.

alex-zaitsev avatar Aug 02 '23 19:08 alex-zaitsev

@alex-zaitsev At the moment, there is no way to update specifically these k8s for various reasons. An update is planned for the future. However, I'm not sure if my problem is related to an outdated version of k8s or operator

Programmeris avatar Aug 03 '23 12:08 Programmeris

i'm still wait CHI resource https://github.com/Altinity/clickhouse-operator/issues/1212#issuecomment-1660246872

Slach avatar Aug 03 '23 12:08 Slach

@Slach CHI yaml definition in my test k8s cluster(managed fields and credentials were removed): chi.txt

At the moment the replicas are working fine. The problem is reproduced if you shutdown all the replicas at once and then include one by one

Programmeris avatar Aug 03 '23 12:08 Programmeris

The problem is reproduced if you shutdown all the replicas at once and then include one by one

Could you provide commands which exactly do this sequence?

Is shared chi.txt was edited manually?

Do you use kind: ClickHouseInstallationTemplate manifests?

Could you share? kubectl get chit --all-namespaces

Maybe you edited default clickhouse-operator config?

Could you share? kubeclt get cm -n infra-test etc-clickhouse-operator-files -o yaml

I see

status:
  fqdns:
  - chi-grif-chi-ss-test-facts-sgl-0-0.infra-test.svc.cluster.local
  - chi-grif-chi-ss-test-facts-sgl-0-1.infra-test.svc.cluster.local
  - chi-grif-chi-ss-test-facts-sgl-0-2.infra-test.svc.cluster.local

which not correlated with

kind: ClickHouseInstallation
metadata:
  name: chi-grif-chi-ss
...
spec:
  configuration:
    clusters:
    - layout:
        replicasCount: 3
        shardsCount: 1
      name: grif-facts-sgl
  

service names have default name conversion chi-{chi_name}-{cluster_name}-{shard_index}-{replica_index}

so it should be chi-chi-grif-chi-ss-grif-facts-sgl-0-0 this is different with chi-grif-chi-ss-test-facts-sgl-0-0

Slach avatar Aug 03 '23 14:08 Slach