koperator icon indicating copy to clipboard operation
koperator copied to clipboard

KOperator is stuck in a rebalance disks loop

Open ilievladiulian opened this issue 1 year ago • 4 comments

Describe the bug Koperator is stuck in a rebalance disks loop which fails with the following message:

{
    "level":"error",
    "ts":"2022-08-25T08:20:33.243Z",
    "msg":"re-balancing disk(s) in Kafka cluster via Cruise Control failed",
    "controller":"CruiseControl",
    "controllerGroup":"kafka.banzaicloud.io",
    "controllerKind":"KafkaCluster",
    "kafkaCluster": {
        "name":"kafka-test",
        "namespace":"ns-kafka-test"
    },
    "namespace":"ns-kafka-test",
    "name":"kafka-test",
    "reconcileID":"2ef740c0-19ba-4091-bfc9-45f0a161a830",
    "operation":"rebalance disks",
    "brokers":["1","2","3"],
    "error":"json: cannot unmarshal number 1.0 into Go struct field BrokerLoadStats.brokers.NumCore of type int32",
    "stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234"
}

Steps to reproduce the issue:

  1. Use a Cruise Control version higher than 2.5.94
  2. Create a kafkacluster CR with one log dir per broker and apply it.
  3. Modify kafkacluster to have 2 log dirs per broker and apply it.
  4. Modify kafkacluster to have 1 log dir per broker and apply it.
  5. KOperator stuck in rebalance disks operation with the error above.

Expected behavior Koperator applies the changes without errors.

Additional context The error above is caused by the rebalance disks call made by koperator to cruise control. After release 2.5.94, cruise control uses a double field for the number of cores used in host load and broker load responses instead of int (PR-1839). The go-cruise-control client still uses an int field (see here).

Edit update: added full error message.

ilievladiulian avatar Aug 31 '22 12:08 ilievladiulian

Hi @ilievladiulian! Thanks for the bug report! If you are sure this is what's causing the issue it seems like a pretty easy fix, would you be interested in creating a PR addressing this?

Kuvesz avatar Aug 31 '22 13:08 Kuvesz

Hi, @Kuvesz! As far as I can tell, the change in Cruise Control breaks compatibility with previous versions. Should the change in Koperator ensure backwards compatibility, or break it as well?

ilievladiulian avatar Aug 31 '22 13:08 ilievladiulian

Well right now we don't support the CruiseControl version you mentioned (check supported versions here), but sooner or later we will have to and in that case it would be best to keep backwards compatibility by checking the Cruise Control version and creating a branching logic based on that. If that's not possible we'll just have to follow suite and break compatibility.

Kuvesz avatar Aug 31 '22 13:08 Kuvesz

I have made a PR for this: https://github.com/banzaicloud/go-cruise-control/pull/11 I have tested with CC 2.5.101 and it works. Koperator next version will contain this and many more.

bartam1 avatar Sep 07 '22 21:09 bartam1