cloud-pipeline
cloud-pipeline copied to clipboard
Issue 880 wait for lost node
This PR provides implementation of #880 It makes some changes to kube configuration, according to discussion in #880 , it also provides implementation of approach of deferred failing of a pipeline that is run on a lost node. Accordingly to this approach:
- New system preference was introduced:
cluster.node.lost.tolerant.seconds AutoscaleManagerwill storetimestampwhen node was lost for the first time- Each next time it will check that 'timeout' (
cluster.node.lost.tolerant.seconds) not exhausted - If 'timeout' will be exhausted pipeline of this node will be failed
- If lost node will have enough time to recover nothing will happen, and pipeline will continue to work, timestamp of last loss will be reset