venice
venice copied to clipboard
[controller][admin-tool] Streamline controller configs and harden update-store workflow
Streamline controller configs and harden update-store workflow
This PR unifies the UpdateStore
logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especially VeniceParentHelixAdmin
) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:
- Add a new config in controllers to specify if the Venice deployment is in multi-region mode or single-region mode.
- Add a new class
UpdateStoreUtils
that is called by both parent and child controllers to apply the store update and perform the necessary validations. - Perform store config validation after applying all configs. This helps check for config compatibility. The following validations are performed currently.
- Batch-only:
- Inc push is not allowed
- WC is not allowed
- Hybrid:
- Rewind time in seconds cannot be negative
- Both offset lag threshold and producer time lag threshold cannot be negative
- In multi-region mode, inc push is not allowed for
NON_AGGREGATE
data replication policy -
ACTIVE_ACTIVE
data replication policy is only supported when Active-Active replication is enabled
- Storage quota cannot be less than 0
- Read quota cannot be less than 0
- Store cannot be active-active replication enabled if it is not also native-replication enabled
- Active-Active and write-compute are not supported when amplification factor is more than 1
- Store must have a partitioner-config set
- Validate if the partitioner can be built with the specified partitioner configs
- Latest superset value schema id must map to a valid value schema id
- Max compaction lag < min compaction lag
- ETL Proxy user must be set. (We might want to deprecate this and remove the corresponding tests, but we have plans to get KIF-based ETL that could benefit from the ETLConfig in store configs)
- Batch-only:
- Incremental push is not explicitly configurable anymore. It is inferred through other configs. Inc push is supported in the following cases:
- In a single-region setup, all hybrid stores are allowed to do incremental pushes
- In a multi-region setup, the store must be hybrid and DataReplicationPolicy is either of
ACTIVE_ACTIVE
,AGGREGATE
orNONE
- Added a new concept of "primary controller". This is the controller that is allowed to perform inferred updates. In a multi-region mode, this is the parent controller. In a single-region mode, this is a child controller.
- Previously each "feature" update incurred an update to the store metadata on Zk - and the corresponding watchers would get invoked. This change replaces all those updates with a single Zk update, which is done after the resulting configs are validated.
- The
ordinal
fromBackupStrategy
enum was being used to write to the admin channel. This is problematic when the enums evolve. Added agetValue
function and used that instead. - Use
multi.region
config to control certain features instead of controlling them via explicit configs.- Native-replication - Enabled in multi-region mode. Disabled in single-region mode.
- Admin channel consumption - Enabled in multi-region mode. Disabled in single-region mode.
- This is still allowed to be disabled via
LiveConfig
for store migration purposes.
- This is still allowed to be disabled via
- Controller allowing BATCH push via API - Disabled in multi-region mode. Enabled in single-region mode.
- Active-active replication enabled on a controller - Enabled in multi-region mode. Disabled in single-region mode.
- The codes for enabling/disabling native replication for cluster have been removed since they are no-longer honored.
- The following controller configs have also been removed:
-
enable.native.replication.for.batch.only
-
enable.native.replication.for.hybrid
-
enable.native.replication.as.default.for.batch.only
-
enable.native.replication.as.default.for.hybrid
-
enable.active.active.replication.as.default.for.batch.only.store
-
admin.topic.remote.consumption.enabled
-
enable.incremental.push.for.hybrid.active.active.user.stores
-
active.active.enabled.on.controller
-
controller.enable.batch.push.from.admin.in.child
-
- A new config:
controller.external.superset.schema.generation.enabled
has been added to replacecontroller.parent.external.superset.schema.generation.enabled
. since external superset schema generation must be allowed in single-region mode also.controller.parent.external.superset.schema.generation.enabled
has been marked deprecated, but it has not been completely removed yet. - Due to the changes to inferred configs, a large chunk of tests started failing as they had incorrect setups. Fixed most of such test setups. Some examples are:
- Parent and child regions sharing same Kafka cluster
- Setting up multi-region mode by explicitly creating parent Zk, controllers and Kafka clusters instead of using
VeniceTwoLayerMultiRegionMultiClusterWrapper
- Creating stores and doing pushes directly in child regions
-
TestFabricBuildout
does this and tests for admin execution id. This is incorrect as that is not how a new region will get added. We need the test setup to support adding blank regions so that we can simulate the true fabric buildout process. Because of this, I've disabled these tests for now.
-
Some side-effects of this change are:
- The new workflow now checks if the store schema can support WC at the time of enabling the WC config. There were a few tests that also used unsupported value schemas for WC.
- Persona validation was only present in parent controller in multi-region mode. This now works in single-region mode as well
-
SupersetSchemaGeneratorWithCustomProp
had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.
There are a few other changes that we should do, but are not done in this PR: All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child
Recommendation for Reviewers
I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip ParentControllerConfigUpdateUtils
as that has been renamed to PrimaryControllerConfigUpdateUtils
and it's contents partially moved to UpdateStoreUtils
. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.
In the second pass, follow this review order:
-
VeniceControllerClusterConfig
,VeniceControllerConfig
,VeniceControllerMultiClusterConfig
-
VeniceControllerService
-
Admin
-
UpdateStoreUtils
-
PrimaryControllerConfigUpdateUtils
6.AdminUtils
-
UpdateStoreWrapper
-
VeniceHelixAdmin
-
VeniceParentHelixAdmin
- All the test modifications
- All newly added tests
- Scripts and doc file changes:
-
clients/venice-pulsar/readme.md
-
docker/venice-client/create-store.sh
-
How was this PR tested?
GH CI
Does this PR introduce any user-facing changes?
- [X] No. You can skip the rest of this section.
- [ ] Yes. Make sure to explain your proposed changes and call out the behavior change.
Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.
Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.
Can you add a section in the PR description that talks about compatibility characteristics of this PR? Looks like we are deprecating lot of features and also introducing new ways or inferring to accomplish some of the existing feature (incremental push etc)
Would be good to explicitly callout what are some of the considerations w.r.t backward compatibility and forward compatibility in the context of the PR and release notes that describe these to end users (where applicable)
@nisargthakkar I wonder whether it is still possible to split the PR into multiple smaller ones. Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether. Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.
I wonder whether it is still possible to split the PR into multiple smaller ones. Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether. Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.
Yes, I've been trying to see how I can break this down. Even if I break it down, I expect at least one PR to be very big. Let me see how that goes
Yes, I've been trying to see how I can break this down. Even if I break it down, I expect at least one PR to be very big. Let me see how that goes
I've merged two PRs related to this: https://github.com/linkedin/venice/pull/1067, https://github.com/linkedin/venice/pull/1070 I have opened another one that should be the biggest of the lot: https://github.com/linkedin/venice/pull/1076
These three would cover the "Streamline controller configs" part of this PR.
After this, I can create a final PR for the "harden update-store workflow"
So we just need to review https://github.com/linkedin/venice/pull/1091 and this PR is no longer needed, correct?