cloud-pipeline icon indicating copy to clipboard operation
cloud-pipeline copied to clipboard

Issue 880 wait for lost node

Open SilinPavel opened this issue 5 years ago • 0 comments

This PR provides implementation of #880 It makes some changes to kube configuration, according to discussion in #880 , it also provides implementation of approach of deferred failing of a pipeline that is run on a lost node. Accordingly to this approach:

  • New system preference was introduced: cluster.node.lost.tolerant.seconds
  • AutoscaleManager will store timestamp when node was lost for the first time
  • Each next time it will check that 'timeout' (cluster.node.lost.tolerant.seconds) not exhausted
  • If 'timeout' will be exhausted pipeline of this node will be failed
  • If lost node will have enough time to recover nothing will happen, and pipeline will continue to work, timestamp of last loss will be reset

SilinPavel avatar Jan 22 '20 12:01 SilinPavel