kafka-operator icon indicating copy to clipboard operation
kafka-operator copied to clipboard

Migrate kafka-operator to use lb-operator for exposing brokers

Open nightkr opened this issue 3 years ago • 8 comments

Description

This is far from final (for example, we'd need to expose a configurable LoadBalancerClass, for example), but I wanted to see whether lb-op would be a good fit for this operator first.

Review Checklist

  • [ ] Code contains useful comments
  • [ ] CRD change approved (or not applicable)
  • [ ] (Integration-)Test cases added (or not applicable)
  • [ ] Documentation added (or not applicable)
  • [ ] Changelog updated (or not applicable)
  • [ ] Cargo.toml only contains references to git tags (not specific commits or branches)
  • [ ] Helm chart can be installed and deployed operator works (or not applicable)

Once the review is done, comment bors r+ (or bors merge) to merge. Further information

nightkr avatar Aug 04 '22 14:08 nightkr

Blocked on https://github.com/stackabletech/lb-operator/pull/1

nightkr avatar Aug 24 '22 12:08 nightkr

Tested and works!

maltesander avatar Aug 30 '22 10:08 maltesander

I've moved this back into "Ready for Development" because the Listener operator is now merged. AFAIU this should now be ready to be worked on again?

lfrancke avatar Oct 24 '22 20:10 lfrancke

I've moved this back into "Ready for Development" because the Listener operator is now merged. AFAIU this should now be ready to be worked on again?

This pull request is updated as requested by stackabletech/listener-operator#5. When stackabletech/operator-rs#496 is merged and operator-rs and the listener-operator are released then this pull request can be made ready for review.

siegfriedweber avatar Oct 26 '22 11:10 siegfriedweber

Tests currently fail due to https://github.com/stackabletech/listener-operator/issues/110.

nightkr avatar Sep 12 '23 14:09 nightkr

The tests now pass!

nightkr avatar Sep 20 '23 00:09 nightkr

@nightkr do you plan to merge this or add the WIP functionality of secret-op beforehand, so that the kafka broker certificates will include the listener addresses? It'd be awesome if we get it right in the first run :)

sbernauer avatar Sep 20 '23 12:09 sbernauer

Preferably not, since we're still working out a few API kinks there (truststore passwords), and I'd rather keep this PR focused. ^^

nightkr avatar Sep 20 '23 13:09 nightkr

@nightkr given you opened https://github.com/stackabletech/operator-rs/pull/858 (reviewed it), I would be very in favor of putting the listenerVolume scope on the secret-op mount, so that we have valid certificates. This would be great for a customer of us ;)

sbernauer avatar Sep 06 '24 06:09 sbernauer

Yes, that was why I opened that PR ^^

nightkr avatar Sep 06 '24 12:09 nightkr

Current (known) issues:

  1. Need to cut an op-rs release
  2. Only online brokers are listed in discovery ConfigMap when using a nodeport listenerclass
    • Brokers that are down will be missed by clients that don't update their ConfigMap regularly
    • Clients that restart on configmap change (like our CM restarter service) will be restarted whenever a broker goes down or up
  3. Discovery ConfigMap when using a ClusterIP listenerclass only exposes K8s-managed LB endpoint.. means more resilience to cluster rescaling but clients are stuck using K8s' native load balancing rather than client side LB/failover.
  4. Bootstrap listener is currently global for the whole cluster, so can't apply rolegroup specific config.

2 and 3 come from currently using a single shared bootstrap listener.. simpler for kafka-op and arguably where we want the UX to go, but means we'll probably want to fix it.

Arguably 3 is desirable, since this only affects initial bootstrap anyway.

nightkr avatar Sep 09 '24 14:09 nightkr

https://github.com/stackabletech/operator-rs/issues/861 is also a pretty blocking issue for this.

nightkr avatar Sep 11 '24 10:09 nightkr

This isn't ready to merge yet (there are still open blockers), but the gist should be reviewable now.

nightkr avatar Sep 11 '24 12:09 nightkr

Clients should only configure the bootstrap address(es), the Kafka client is then responsible for resolving that to the relevant replicas.

nightkr avatar Oct 02 '24 10:10 nightkr

Agreed. I'm a bit torn between "we should label everything" and "I don't want everything to show in stacklet list". I'm happy to merge as-is, but than please add a comment above Labels::new() why we choose to not label the broker Listener objects.

sbernauer avatar Oct 02 '24 10:10 sbernauer

Just for the record: I think there is value in having the metrics endpoint for every individual broker, but as most customers will probably use Prometheus ServiceMonitor this is rather low prio

sbernauer avatar Oct 02 '24 10:10 sbernauer

Somewhat related: Is it intentional that the simple-kafka-broker-default-bootstrap Listener contains the ingressAddress three times in the status? Do we want to de-duplicate that list in listener-op?

apiVersion: listeners.stackable.tech/v1alpha1
kind: Listener
metadata:
  name: simple-kafka-broker-default-bootstrap
  namespace: default
# ...
spec:
  className: external-stable
  extraPodSelectorLabels: {}
  ports:
  - name: metrics
    port: 9606
    protocol: TCP
  - name: kafka
    port: 9092
    protocol: TCP
  publishNotReadyAddresses: null
status:
  ingressAddresses:
  - address: 172.18.0.2
    addressType: IP
    ports:
      kafka: 32639
      metrics: 30178
  - address: 172.18.0.2
    addressType: IP
    ports:
      kafka: 32639
      metrics: 30178
  - address: 172.18.0.2
    addressType: IP
    ports:
      kafka: 32639
      metrics: 30178
  nodePorts:
    kafka: 32639
    metrics: 30178
  serviceName: simple-kafka-broker-default-bootstrap

sbernauer avatar Oct 02 '24 10:10 sbernauer

Somewhat related: Is it intentional that the simple-kafka-broker-default-bootstrap Listener contains the ingressAddress three times in the status?

That smells like a minor bug that I would suggest filing against list-op.

nightkr avatar Oct 02 '24 13:10 nightkr

That smells like a minor bug that I would suggest filing against list-op.

I was not sure if this was a bug. Created https://github.com/stackabletech/listener-operator/issues/229

sbernauer avatar Oct 02 '24 13:10 sbernauer

Tests passed for 291308109e40448f238742ec7a7b0541f608d11d: https://testing.stackable.tech/job/kafka-operator-it-custom/8/, https://testing.stackable.tech/job/kafka-operator-it-custom/9/

nightkr avatar Oct 04 '24 10:10 nightkr

Thanks, LGTM! Hopefully last remaining question: We do we have a bootstrap service per rolegroup? Don't all bootstrap services return all brokers (from all rolegroups) regardless? Why aren't we using a single bootstrap for the entire KafkaCluster?

sbernauer avatar Oct 07 '24 09:10 sbernauer

Once connected to the bootstrap service, yes, you'll get all brokers. It's mostly a matter of consistency and "maaaybe this will be useful in some odd networking context, somewhere?".

nightkr avatar Oct 07 '24 09:10 nightkr