Daniel Kłobuszewski comments

Results 124 comments of


                                            Daniel Kłobuszewski

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

/reopen /remove-lifecycle rotten

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

I wonder if using NoExecute taint effect instead of NoSchedule would be sufficient to fix this. Perhaps with some configurable delay between tainting and actually removing the node.

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

Today it applies NoSchedule taint, manually evicts pods using eviction API and then deletes the node. In case of empty nodes, it just applies the taint and deletes the node...

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

Ok, I think with that this becomes a fairly well-defined task, let's see if someone would be able to pick it up. Hopefully the change can be confined just to...

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

/help

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

Hi @jan-skarupa, thanks for looking into this! Interesting, so it looks like VM deletion is just causing OS to send SIGTERM to kubelet, which then [initiates graceful shutdown in 1.21+...

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

To actually avoid the race condition at all, we would have to separate tainting from drain&deletion. That would require changes to ScaleDown interface so that Actuator would have two separate...

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

@jan-skarupa are you up for this? It is definitely a bigger change than just adding `NoExecute` taint.

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

I talked offline about this with @MaciekPytel. The conclusion we came up with was that it should be both simpler and less risky if Actuator treat empty node becoming non-empty...

Identifying cloud provider deleted nodes

I think extending the API would be much cleaner, but the need to implement it for all cloud providers calls for a broader discussion. I added this topic to SIG...