venice icon indicating copy to clipboard operation
venice copied to clipboard

[controller][admin-tool] Streamline controller configs and harden update-store workflow

Open nisargthakkar opened this issue 10 months ago • 6 comments

Streamline controller configs and harden update-store workflow

This PR unifies the UpdateStore logic between child controller and parent controllers. We've faced many issues due to the structure of the code (especially VeniceParentHelixAdmin) where a set of store-update operations puts it into a non-healthy state. These are all the changes in this PR:

  1. Add a new config in controllers to specify if the Venice deployment is in multi-region mode or single-region mode.
  2. Add a new class UpdateStoreUtils that is called by both parent and child controllers to apply the store update and perform the necessary validations.
  3. Perform store config validation after applying all configs. This helps check for config compatibility. The following validations are performed currently.
    1. Batch-only:
      1. Inc push is not allowed
      2. WC is not allowed
    2. Hybrid:
      1. Rewind time in seconds cannot be negative
      2. Both offset lag threshold and producer time lag threshold cannot be negative
      3. In multi-region mode, inc push is not allowed for NON_AGGREGATE data replication policy
      4. ACTIVE_ACTIVE data replication policy is only supported when Active-Active replication is enabled
    3. Storage quota cannot be less than 0
    4. Read quota cannot be less than 0
    5. Store cannot be active-active replication enabled if it is not also native-replication enabled
    6. Active-Active and write-compute are not supported when amplification factor is more than 1
    7. Store must have a partitioner-config set
    8. Validate if the partitioner can be built with the specified partitioner configs
    9. Latest superset value schema id must map to a valid value schema id
    10. Max compaction lag < min compaction lag
    11. ETL Proxy user must be set. (We might want to deprecate this and remove the corresponding tests, but we have plans to get KIF-based ETL that could benefit from the ETLConfig in store configs)
  4. Incremental push is not explicitly configurable anymore. It is inferred through other configs. Inc push is supported in the following cases:
    1. In a single-region setup, all hybrid stores are allowed to do incremental pushes
    2. In a multi-region setup, the store must be hybrid and DataReplicationPolicy is either of ACTIVE_ACTIVE, AGGREGATE or NONE
  5. Added a new concept of "primary controller". This is the controller that is allowed to perform inferred updates. In a multi-region mode, this is the parent controller. In a single-region mode, this is a child controller.
  6. Previously each "feature" update incurred an update to the store metadata on Zk - and the corresponding watchers would get invoked. This change replaces all those updates with a single Zk update, which is done after the resulting configs are validated.
  7. The ordinal from BackupStrategy enum was being used to write to the admin channel. This is problematic when the enums evolve. Added a getValue function and used that instead.
  8. Use multi.region config to control certain features instead of controlling them via explicit configs.
    1. Native-replication - Enabled in multi-region mode. Disabled in single-region mode.
    2. Admin channel consumption - Enabled in multi-region mode. Disabled in single-region mode.
      1. This is still allowed to be disabled via LiveConfig for store migration purposes.
    3. Controller allowing BATCH push via API - Disabled in multi-region mode. Enabled in single-region mode.
    4. Active-active replication enabled on a controller - Enabled in multi-region mode. Disabled in single-region mode.
  9. The codes for enabling/disabling native replication for cluster have been removed since they are no-longer honored.
  10. The following controller configs have also been removed:
    1. enable.native.replication.for.batch.only
    2. enable.native.replication.for.hybrid
    3. enable.native.replication.as.default.for.batch.only
    4. enable.native.replication.as.default.for.hybrid
    5. enable.active.active.replication.as.default.for.batch.only.store
    6. admin.topic.remote.consumption.enabled
    7. enable.incremental.push.for.hybrid.active.active.user.stores
    8. active.active.enabled.on.controller
    9. controller.enable.batch.push.from.admin.in.child
  11. A new config: controller.external.superset.schema.generation.enabled has been added to replace controller.parent.external.superset.schema.generation.enabled. since external superset schema generation must be allowed in single-region mode also. controller.parent.external.superset.schema.generation.enabled has been marked deprecated, but it has not been completely removed yet.
  12. Due to the changes to inferred configs, a large chunk of tests started failing as they had incorrect setups. Fixed most of such test setups. Some examples are:
    1. Parent and child regions sharing same Kafka cluster
    2. Setting up multi-region mode by explicitly creating parent Zk, controllers and Kafka clusters instead of using VeniceTwoLayerMultiRegionMultiClusterWrapper
    3. Creating stores and doing pushes directly in child regions
      1. TestFabricBuildout does this and tests for admin execution id. This is incorrect as that is not how a new region will get added. We need the test setup to support adding blank regions so that we can simulate the true fabric buildout process. Because of this, I've disabled these tests for now.

Some side-effects of this change are:

  1. The new workflow now checks if the store schema can support WC at the time of enabling the WC config. There were a few tests that also used unsupported value schemas for WC.
  2. Persona validation was only present in parent controller in multi-region mode. This now works in single-region mode as well
  3. SupersetSchemaGeneratorWithCustomProp had a bug where if the first schema has a custom prop, the future superset schema generation would fail as Avro doesn't allow overriding custom props. This got caught as the update store logic now also tries creating superset schema if a store enabled RC or WC, or if it previously had a superset schema.

There are a few other changes that we should do, but are not done in this PR: All major operations should only be allowed on the parent controller - Create a store, delete a store, add schemas. We should exclude some system stores from this check like we have for the check allowing batch push to admin in child

Recommendation for Reviewers

I recommend going through the changes in at least two passes. In the first pass, look through all the stuff that has been purely deleted. (Skip ParentControllerConfigUpdateUtils as that has been renamed to PrimaryControllerConfigUpdateUtils and it's contents partially moved to UpdateStoreUtils. So, GitHub doesn't detect is as a renaming). While doing this pass, you can also glance through various small changes that do not need much pondering over.

In the second pass, follow this review order:

  1. VeniceControllerClusterConfig, VeniceControllerConfig, VeniceControllerMultiClusterConfig
  2. VeniceControllerService
  3. Admin
  4. UpdateStoreUtils
  5. PrimaryControllerConfigUpdateUtils 6.AdminUtils
  6. UpdateStoreWrapper
  7. VeniceHelixAdmin
  8. VeniceParentHelixAdmin
  9. All the test modifications
  10. All newly added tests
  11. Scripts and doc file changes:
    1. clients/venice-pulsar/readme.md
    2. docker/venice-client/create-store.sh

How was this PR tested?

GH CI

Does this PR introduce any user-facing changes?

  • [X] No. You can skip the rest of this section.
  • [ ] Yes. Make sure to explain your proposed changes and call out the behavior change.

nisargthakkar avatar Apr 16 '24 13:04 nisargthakkar

Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.

mynameborat avatar Apr 18 '24 16:04 mynameborat

Thank you for the recommendations to the reviewer. Is it still possible to break down this PR? Even with the recommendations, it seems quite complex to go about the review unless there is some context I am missing which forces the way this PR is structured.

Can you add a section in the PR description that talks about compatibility characteristics of this PR? Looks like we are deprecating lot of features and also introducing new ways or inferring to accomplish some of the existing feature (incremental push etc)

Would be good to explicitly callout what are some of the considerations w.r.t backward compatibility and forward compatibility in the context of the PR and release notes that describe these to end users (where applicable)

mynameborat avatar Apr 18 '24 17:04 mynameborat

@nisargthakkar I wonder whether it is still possible to split the PR into multiple smaller ones. Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether. Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.

gaojieliu avatar Apr 24 '24 18:04 gaojieliu

I wonder whether it is still possible to split the PR into multiple smaller ones. Reviewing a large (giant) PR can easily miss some important logics as there are just too many changes altogether. Since this is the critical code path, I think it is worth spending some effort to improve the review experience, so that we can catch the potential regressions.

Yes, I've been trying to see how I can break this down. Even if I break it down, I expect at least one PR to be very big. Let me see how that goes

nisargthakkar avatar Apr 24 '24 18:04 nisargthakkar

Yes, I've been trying to see how I can break this down. Even if I break it down, I expect at least one PR to be very big. Let me see how that goes

I've merged two PRs related to this: https://github.com/linkedin/venice/pull/1067, https://github.com/linkedin/venice/pull/1070 I have opened another one that should be the biggest of the lot: https://github.com/linkedin/venice/pull/1076

These three would cover the "Streamline controller configs" part of this PR.

After this, I can create a final PR for the "harden update-store workflow"

nisargthakkar avatar Jul 24 '24 01:07 nisargthakkar

So we just need to review https://github.com/linkedin/venice/pull/1091 and this PR is no longer needed, correct?

adamxchen avatar Aug 08 '24 16:08 adamxchen