common icon indicating copy to clipboard operation
common copied to clipboard

Consider supporting SuccessPolicy and FailurePolicy

Open terrytangyuan opened this issue 4 years ago • 4 comments

We recently added SuccessPolicy in tf-operator https://github.com/kubeflow/tf-operator/pull/1165 and are considering adding FailurePolicy to handle the case of failure in https://github.com/kubeflow/tf-operator/issues/1170. Once it's mature and if we see a common pattern in other operators, we should consider moving that to kubeflow/common.

cc @gaocegege @Jeffwan @johnugeorge @ChanYiLin @pingsutw

terrytangyuan avatar Jun 17 '20 14:06 terrytangyuan

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.77
area/operator 0.85

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jun 17 '20 14:06 issue-label-bot[bot]

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.77

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

kf-label-bot-dev[bot] avatar Jun 17 '20 14:06 kf-label-bot-dev[bot]

Having success/failure would be great which would be easier for different frameworks to handle errors and it help make reconciler logic extensible.

Jeffwan avatar Jun 22 '20 16:06 Jeffwan

With fault-tolerant & elastic distributed training propagating among more frameworks, a universal definition of failure and success for a distributed training job shall benefit developers for clarifying logic when handling pods failed or recently joined.

zw0610 avatar Aug 11 '20 07:08 zw0610