apollo 支持按实例数目/比例的灰度策略适应容器化的场景

目前配置灰度策略，有IP和Label两种实现，但是对于容器化的场景是不友好的，主要有一下几个考量

ip的方式不适用于容器化场景，因为容器重启会出现ip漂移不能保证稳定的灰度，单个容器重启灰度配置会失效（灰度不稳定）
通过label的方式将容器和配置强绑定了，每次发布新配置需要更改label重启应用，然后配置灰度策略，这样增加了配置变更的维护成本（发布配置成本高）

能否实现一种按一定的实例数量或者比例灰度，比如容器有20个实例，10%就始终保持2个容器获取到灰度配置，同时要考虑容器漂移带来的影响，比如未灰度的容器重启应该获取到的还是原来的配置，灰度的容器重启获取到依旧是灰度的配置

Oct 26 '21 08:10 whl12345

I think we could have the 3rd grayscale rule to support the percentage scenario, but the implementation details need more discussion, e.g. how to keep the container fetch the grayscale configurations after reboot.

Oct 27 '21 00:10 nobodyiam

2个简单想法：

ConfigService从当前主配置里获取完整实例列表，然后从中选取一部分实例，同时依靠定时任务更新
还有一种是实例注册的时候，获取当前应用的灰度配置更新

Oct 27 '21 06:10 whl12345

If the instance list is static, the percentage grayscale rule is easy to implement. However, if the instance list is changing, which is quite common considering deployment/rebooting/scaling scenarios, then it's quite challenging to keep the grayscale percentage and might bring unexpected behaviors. So I suggest we propose a detailed design before actually working on this.

Oct 28 '21 00:10 nobodyiam

配置中心按机器数灰度 (1)

Nov 02 '21 08:11 whl12345

简单设计了一个按机器数灰度的策略，让业务在portal上指定应用灰度的实例数量，可以通过后台定时计算并更新灰度的实例列表。

Nov 02 '21 08:11 whl12345

这个方案有个缺陷，相比标签发布，无法保证灰度实例列表IP的稳定，可能灰度IP重启后会路由到另一个实例，把原始实例变为灰度实例

Nov 02 '21 08:11 whl12345

Does it mean we need to re-calculate the grayscale instance list every time when the instance lists change(step 2)? This will make the implementation complicated. However, I have a suggestion that how about we provide this percentage rule only as a feature to help user to maintain the grayscale ip list? e.g. when the user chooses to grayscale 10%, then we find out the current list and pick 10% of the instance ip list, the user could choose to use the randomly chosen 10% ip list or he could modify the ip list. The actual rule that persisted is simply the 10% ip list.

Nov 03 '21 00:11 nobodyiam

Does it mean we need to re-calculate the grayscale instance list every time when the instance lists change(step 2)?

No, we use "grayscale amount" instead of "grayscale percent".

While user confirm "grayscale amount" , the grayscale instance is fixed, and the grayscale instance list will be shown on webpage, so user can know which node is grayscale or normal.
while grayscale instance is shutdown, and we choose a new grayscale instance randomly, and update grayscale instance list on webpage.
while normal instances are shutdown, nothing happens

Nov 11 '21 05:11 yu-hailong

And we think on k8s env, every node is same. User just want to know which nodes are grayscale instance and check it work well or not under grayscale configuration.

So we let user set "grayscale instance amount" , and show them grayscale instance list. We think it is enough for most users.

Nov 11 '21 05:11 yu-hailong

Thanks for the clarification, the grayscale amount is an interesting idea. However, I still have some concerns about the dynamic grayscale lists, but I'm open to discussions. BTW, I think for k8s scenarios, the grayscale label is a better solution as it is more stable?

Nov 12 '21 01:11 nobodyiam

For grayscale amount rules, I begin to develop and make some test, the design as follow.

I create a schedule task at portal to select all grayRules which type is grayscale amount by adminAPI.
The task will check the current gray ips with active ips. If some gray ips are not active, they will be removed. An algorithm is designed to remove old inactive gray ips and add new active gray ips, making sure the grayscale amount is same with gray ips size.New active ips are choosen by dataChangeCreatedTime of instance priority.

Some problem When I use /instances/by-release api to get active ips I found the response ips are not stable, so it make difficult to choose new grayscale ips. Why the active instances list is changed so often, I support it may have something to do with the apollo design which definate instances 2 minutes from now are avtive. The instances register period is half minute.It is diffiult to understand why it is unstable.

I dont know why the active instances list is changed without reboot/scale triggered.Please give me some suggestions.

Nov 16 '21 02:11 whl12345

Thanks for the clarification, the grayscale amount is an interesting idea. However, I still have some concerns about the dynamic grayscale lists, but I'm open to discussions. BTW, I think for k8s scenarios, the grayscale label is a better solution as it is more stable?

Yes, but when gray configs are changed, the grayscale label rules need user to reboot k8s deployment, which may cover all pods.It is difficult to control step.The grayscale amount rules dont need to reboot deployment and it is easier to control step.

Nov 16 '21 02:11 whl12345

it may have something to do with the apollo design which defines instances 2 minutes from now are active

the current active instance definition is 25 hours, see

https://github.com/apolloconfig/apollo/blob/5ad5b410f0ff43111855d7bd76e047d08ab477b4/apollo-biz/src/main/java/com/ctrip/framework/apollo/biz/service/InstanceService.java#L155-L164

Nov 17 '21 00:11 nobodyiam

Yes, but when gray configs are changed, the grayscale label rules need user to reboot k8s deployment

I suppose the grayscale period should be temporary. So we may always keep some amount of pods with the grayscale label and then we could apply the grayscale label rule if we want to test some new configurations and then remove the grayscale label rule if the test is finished. In this case, there is no need to reboot the k8s deployments and we could fully reuse k8s to keep the grayscale amount for us.

Nov 17 '21 00:11 nobodyiam

Yes, but when gray configs are changed, the grayscale label rules need user to reboot k8s deployment

I suppose the grayscale period should be temporary. So we may always keep some amount of pods with the grayscale label and then we could apply the grayscale label rule if we want to test some new configurations and then remove the grayscale label rule if the test is finished. In this case, there is no need to reboot the k8s deployments and we could fully reuse k8s to keep the grayscale amount for us.

Sometimes, we dont change our code, and just want to change config by gray rules. In this case we need to reboot k8s.

Nov 17 '21 01:11 whl12345

it may have something to do with the apollo design which defines instances 2 minutes from now are active

the current active instance definition is 25 hours, see

https://github.com/apolloconfig/apollo/blob/5ad5b410f0ff43111855d7bd76e047d08ab477b4/apollo-biz/src/main/java/com/ctrip/framework/apollo/biz/service/InstanceService.java#L155-L164

Sorry, I find someone changed the rule 25hour to 2 minutes.

Nov 17 '21 02:11 whl12345

Yes, but when gray configs are changed, the grayscale label rules need user to reboot k8s deployment

I suppose the grayscale period should be temporary. So we may always keep some amount of pods with the grayscale label and then we could apply the grayscale label rule if we want to test some new configurations and then remove the grayscale label rule if the test is finished. In this case, there is no need to reboot the k8s deployments and we could fully reuse k8s to keep the grayscale amount for us.

Sometimes, we dont change our code, and just want to change config by gray rules. In this case we need to reboot k8s.

I think we could always keep some amount of pods with a grayscale label and only apply the rules when we want to test the new configurations? In this case, we don't need to always reboot k8s? e.g. If some application has 10 pods, we always label 3 of them with a grayscale label. If there is no grayscale rule, then all the 10 pods receive the same configurations. If there is some new configuration we need to test, then we apply the grayscale rule to make 3 of them updated with the new configurations.

Nov 19 '21 00:11 nobodyiam

Yes, but when gray configs are changed, the grayscale label rules need user to reboot k8s deployment

I suppose the grayscale period should be temporary. So we may always keep some amount of pods with the grayscale label and then we could apply the grayscale label rule if we want to test some new configurations and then remove the grayscale label rule if the test is finished. In this case, there is no need to reboot the k8s deployments and we could fully reuse k8s to keep the grayscale amount for us.

Sometimes, we dont change our code, and just want to change config by gray rules. In this case we need to reboot k8s.

I think we could always keep some amount of pods with a grayscale label and only apply the rules when we want to test the new configurations? In this case, we don't need to always reboot k8s? e.g. If some application has 10 pods, we always label 3 of them with a grayscale label. If there is no grayscale rule, then all the 10 pods receive the same configurations. If there is some new configuration we need to test, then we apply the grayscale rule to make 3 of them updated with the new configurations.

Sometimes we need to expand number of gray instances dynamicly. For this case, we need to reboot k8s to add new label.Your scenerio is just suitable for static gray number.

Nov 19 '21 10:11 whl12345

I think we could always keep some amount of pods with a grayscale label and only apply the rules when we want to test the new configurations? In this case, we don't need to always reboot k8s? e.g. If some application has 10 pods, we always label 3 of them with a grayscale label. If there is no grayscale rule, then all the 10 pods receive the same configurations. If there is some new configuration we need to test, then we apply the grayscale rule to make 3 of them updated with the new configurations.

We agree that grayscale label is a good idea while application and configuration are grayscaled at the same time in K8S env.

But in most case, user do use configuration because user want to modify configure instead of program while they do some test. Do you think so ?

And in our company, while most app is deployed in K8S, all containers are same. and user don't care which node is normal ,and while node is grayscale. They just want to grayscale some node's configuration and do comfirm.

But grayscale ip list should be shown on portal, so they can check grayscale effort.

Nov 19 '21 11:11 yu-hailong

Well, then the grayscale amount rule sounds a reasonable feature to me. However, we may need to let the user know clearly what this feature does, e.g. dynamically updating the grayscale instance list. And I'm looking forward to the implementation.

Nov 20 '21 05:11 nobodyiam

apollo apollo copied to clipboard

支持按实例数目/比例的灰度策略适应容器化的场景

apollo
apollo copied to clipboard