chainer-operator
chainer-operator copied to clipboard
Move to bare pod model like TFJob
Currently chainer-operator expands ChainerJob to
- one master
kind: Job - several worker set
kind: StatefulSets
kind: Job will keep retrying even though a failure was caused by user code bug. activeDeadlineSeconds can mitigate this. however, this doesn't work in practice because actual jobs often run for a very long time.
So, my idea is to drop Job and StatefulSets and move to bare Pods models like TFJob. Then, users can use retryPolicy: ErrorCode to control retry behavior.
/area 0.4.0 /priority p1
/priority p2
@everpeace do we need any other proposals/design docs?
/remove-priority p1
@chrisheecho I would like to update current design docs. I'd like to follow pytorch/tensorflow api model, too.