clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Adding an invalid label causes whole cluster to be removed

Open alexvanolst opened this issue 1 year ago • 4 comments

Operator version: 0.23.5

Adding an invalid label to a podtemplate, eventually causes the operator to delete all statefulsets during reconciliation, regardless of settings.

I have the following settings:

    runtime:
        reconcileCHIsThreadsNumber: 10
        reconcileShardsThreadsNumber: 5
        reconcileShardsMaxConcurrencyPercent: 50
        threadsNumber: 0
    statefulSet:
        create:
            onFailure: abort
        update:
            timeout: 300
            pollInterval: 5
            onFailure: rollback
    host:
        wait:
            exclude: "true"
            queries: "true"
            include: "false"

After adding an invalid label to spec.templates.podTemplates[0].metadata.label e.g. some_bad_label: '/metrics' the operator tries to recreate the statefulsets but encounters the following error:

E0508 13:36:35.377188       1 creator.go:46] createStatefulSet():StatefulSet create failed. err: StatefulSet.apps "chi-clickhouse-store-0-0" is invalid: spec.template.labels: Invalid value: "/metrics": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')

My expected behavior is: After failing to create the statefulset the operator either aborts or rolls back

Actual behavior: After some time period, the operator moves to the next statefulset until all are deleted (and not recreated due to error)

alexvanolst avatar May 17 '24 10:05 alexvanolst

Please check these behaviors: https://github.com/Altinity/clickhouse-operator/blob/bbbf66a8e0fbbcf36b787a63eceeaca37e0ec272/config/config.yaml#L256

sunsingerus avatar May 17 '24 15:05 sunsingerus

Try to modify

update:
            onFailure: rollback

to

update:
            onFailure: abort

sunsingerus avatar May 17 '24 15:05 sunsingerus

rollback needs to be checked

sunsingerus avatar May 17 '24 15:05 sunsingerus

@sunsingerus

I checked this with

update:
            onFailure: abort

and I still get the exact same behavior. After ~15mins it continues.

alexvanolst avatar May 20 '24 15:05 alexvanolst