chainer-operator icon indicating copy to clipboard operation
chainer-operator copied to clipboard

Move to bare pod model like TFJob

Open everpeace opened this issue 7 years ago • 4 comments

Currently chainer-operator expands ChainerJob to

  • one master kind: Job
  • several worker set kind: StatefulSets

kind: Job will keep retrying even though a failure was caused by user code bug. activeDeadlineSeconds can mitigate this. however, this doesn't work in practice because actual jobs often run for a very long time.

So, my idea is to drop Job and StatefulSets and move to bare Pods models like TFJob. Then, users can use retryPolicy: ErrorCode to control retry behavior.

/area 0.4.0 /priority p1

everpeace avatar Oct 10 '18 00:10 everpeace

/priority p2

jbottum avatar Nov 04 '18 00:11 jbottum

@everpeace do we need any other proposals/design docs?

chrisheecho avatar Nov 05 '18 19:11 chrisheecho

/remove-priority p1

chrisheecho avatar Nov 05 '18 19:11 chrisheecho

@chrisheecho I would like to update current design docs. I'd like to follow pytorch/tensorflow api model, too.

everpeace avatar Nov 10 '18 01:11 everpeace