kafka
kafka copied to clipboard
KAFKA-13504: Retry connect internal topics' creation in case of InvalidReplicationFactorException
More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.
Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
Thanks @akatona84 for the PR. In order to add new configurations, we need to have a KIP
Thanks @akatona84 for the PR. In order to add new configurations, we need to have a KIP
Hey, thx for checking. Any idea/suggestions avoiding (too much) new configs because of the retry?
Sorry for the delay, but this finally made it back to the top of my review queue.
I'm not sure we necessarily need to do something here. Starting Connect from scratch when not enough brokers are running must be a pretty rare scenario (maybe more common when developing). Connect has no idea when, or even if, enough brokers will join the cluster so it's hard to decide whether to keep retrying to just stop and let the environment (for example k8s) handle the restarts.
If you definitively want Connect to do some retries, maybe we could retry for default.api.timeout.ms
. I typically find setting a bound as a duration more intuitive than setting a number of retries. Also having separate retries settings for each topic may be a bit overkill :)
I totally agree that I have too many configs :D
Yeah, it's happening quite often when testing, just starting the brokers and connect cluster and the health check of connect alerts due to its unexpected death. I know it should be the orchestrator's duty to start connect after brokers are fine, yet thought would be nice to make it more robust. Now I tried to rebase and see KIP-618 efforts on trunk, will try again to rebase and rewrite it to time bound retries. It would still need two configs: the timeout and the backoff as I see default.api.timeout.ms is not defined for the worker/distributed.
tbh I'm not 100% sure if this robustness worth the effort, I'll think about it.
@urbandan , @lhunyady I've reworked the change, now it has the timeout exception and timeout based retries (instead of retry counts). Could you guys take a look please?
@akatona84 Thanks for taking another look and simplifying the configuration, I think it looks much better. This will still need a KIP.
@akatona84 would you consider using default.api.timeout.ms
and backoff based on retry.backoff.ms
? I can't really justify having these new configs introduced specificly for this until we realize we really need them. At that point someone can create a KIP.
@mimaison , @viktorsomogyi I modified the change as you suggested, I'm curious how you like it this way. sorry for the loong delay
test this please
Does that kip label still apply?
rebased
@akatona84 seems like there is a compile issue. Please fix it.
thx. rebased again and did some adjustments. (this time I built it and ran my modified unit tests as well :) )
Thank you @akatona84 for the contribution!