cloud-pipeline Issue 880 wait for lost node

Issue 880 wait for lost node

Open SilinPavel opened this issue 5 years ago • 0 comments

This PR provides implementation of #880 It makes some changes to kube configuration, according to discussion in #880 , it also provides implementation of approach of deferred failing of a pipeline that is run on a lost node. Accordingly to this approach:

New system preference was introduced: cluster.node.lost.tolerant.seconds
AutoscaleManager will store timestamp when node was lost for the first time
Each next time it will check that 'timeout' (cluster.node.lost.tolerant.seconds) not exhausted
If 'timeout' will be exhausted pipeline of this node will be failed
If lost node will have enough time to recover nothing will happen, and pipeline will continue to work, timestamp of last loss will be reset

Jan 22 '20 12:01 SilinPavel

cloud-pipeline cloud-pipeline copied to clipboard

Issue 880 wait for lost node

cloud-pipeline
cloud-pipeline copied to clipboard