kafka KAFKA-13504: Retry connect internal topics' creation in case of InvalidReplicationFactorException

More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.

Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

[ ] Verify design and implementation
[ ] Verify test coverage and CI build status
[ ] Verify documentation (including upgrade notes)

Dec 03 '21 14:12 akatona84

Thanks @akatona84 for the PR. In order to add new configurations, we need to have a KIP

Feb 10 '22 17:02 mimaison

Thanks @akatona84 for the PR. In order to add new configurations, we need to have a KIP

Hey, thx for checking. Any idea/suggestions avoiding (too much) new configs because of the retry?

Feb 10 '22 17:02 akatona84

Sorry for the delay, but this finally made it back to the top of my review queue.

I'm not sure we necessarily need to do something here. Starting Connect from scratch when not enough brokers are running must be a pretty rare scenario (maybe more common when developing). Connect has no idea when, or even if, enough brokers will join the cluster so it's hard to decide whether to keep retrying to just stop and let the environment (for example k8s) handle the restarts.

If you definitively want Connect to do some retries, maybe we could retry for default.api.timeout.ms. I typically find setting a bound as a duration more intuitive than setting a number of retries. Also having separate retries settings for each topic may be a bit overkill :)

May 06 '22 10:05 mimaison

I totally agree that I have too many configs :D

Yeah, it's happening quite often when testing, just starting the brokers and connect cluster and the health check of connect alerts due to its unexpected death. I know it should be the orchestrator's duty to start connect after brokers are fine, yet thought would be nice to make it more robust. Now I tried to rebase and see KIP-618 efforts on trunk, will try again to rebase and rewrite it to time bound retries. It would still need two configs: the timeout and the backoff as I see default.api.timeout.ms is not defined for the worker/distributed.

tbh I'm not 100% sure if this robustness worth the effort, I'll think about it.

Jul 15 '22 21:07 akatona84

@urbandan , @lhunyady I've reworked the change, now it has the timeout exception and timeout based retries (instead of retry counts). Could you guys take a look please?

Jul 21 '22 20:07 akatona84

@akatona84 Thanks for taking another look and simplifying the configuration, I think it looks much better. This will still need a KIP.

Jul 29 '22 16:07 mimaison

@akatona84 would you consider using default.api.timeout.ms and backoff based on retry.backoff.ms? I can't really justify having these new configs introduced specificly for this until we realize we really need them. At that point someone can create a KIP.

Mar 08 '23 09:03 viktorsomogyi

@mimaison , @viktorsomogyi I modified the change as you suggested, I'm curious how you like it this way. sorry for the loong delay

Mar 08 '23 14:03 akatona84

test this please

Mar 22 '23 08:03 akatona84

Does that kip label still apply?

Mar 22 '23 08:03 akatona84

rebased

May 09 '23 13:05 akatona84

@akatona84 seems like there is a compile issue. Please fix it.

May 10 '23 09:05 viktorsomogyi

thx. rebased again and did some adjustments. (this time I built it and ran my modified unit tests as well :) )

May 11 '23 09:05 akatona84

Thank you @akatona84 for the contribution!

May 24 '23 09:05 viktorsomogyi

kafka kafka copied to clipboard

KAFKA-13504: Retry connect internal topics' creation in case of InvalidReplicationFactorException

Committer Checklist (excluded from commit message)

kafka
kafka copied to clipboard