Jesse Hu comments

Results 76 comments of


                                            Jesse Hu

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP

Thanks @sbueringer @neolit123. +1 on this if it can do well, and the PR title should be the final squashed commit msg? The commits and review history is also reserved....

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP

Thanks @sbueringer @fabriziopandini a lot for your review and patience! The auto cherry-pick failed for release-1.6 and 1.5. My team member @Levi080513 can create new PR for release-1.6 separatly if...

Improve patch helper

We hit bugs in the described scenario by @sbueringer when using CAPI patchHelper in our controllers due to the optimistic locking is not used to write Spec & Status, only...

Improve patch helper

This is the case in CAPI controller. As [PatchHelper](https://github.com/kubernetes-sigs/cluster-api/blob/2c0771782941d624e6281c953ffb33413ce9106a/util/patch/patch.go#L133-L143) will patch CR.Status.Conditions -> CR.Spec & CR.Metadata -> CR.Status in sequence. > Optimistic locking is not used to write Spec &...

Improve patch helper

We hit another problem caused by CAPI patchHelper without setting resourceVersion. When creating two ClusterResourceSets for a Cluster at the same time, CAPI starts [reconciling ClusterResourceSets](https://github.com/kubernetes-sigs/cluster-api/blob/8b2541151f049ae975591cb0921c72cc6b022326/exp/addons/internal/controllers/clusterresourceset_controller.go#L266) and both reconciles use...

切分的颗粒度是多少？

hi @archlitchi, 请问是否存在这样的算力控制现象：GPU算力单元的利用率会超过设置的值（比如单卡切分为2卡，显存是控制住了50%，但某一张虚拟卡的算力利用率会在一些小时间段内超过50%）

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

Thanks @fabriziopandini. The error ErrClusterLocked should be gone in a short time, so marking the Node as notReady or unknown replica immediately after hitting error ErrClusterLocked might be over responsive....

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

BTW this could also impacted by https://github.com/kubernetes-sigs/cluster-api/pull/9810 discussed in https://github.com/kubernetes-sigs/cluster-api/issues/10165#issuecomment-1952727622

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

I made a PR to fix this bug with a simple approach (*not* implementing unknownReplicas). Please kindly take a look. Thanks!

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

/area machineset