mxnet-operator
mxnet-operator copied to clipboard
the status of worker-0 is error, but the status of mxjob is Succeeded
kubeflow version: 0.5.0 mxnet-operator version: v1beta1
kubernetes dashboard display:
worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
mxjob status:
{
"status": {
"completionTime": "2019-05-21T08:37:24Z",
"conditions": [
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:36:41Z",
"message": "MXJob mxnet-8d1f211e is created.",
"reason": "MXJobCreated",
"status": "True",
"type": "Created"
},
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:36:46Z",
"message": "MXJob mxnet-8d1f211e is running.",
"reason": "MXJobRunning",
"status": "False",
"type": "Running"
},
{
"lastTransitionTime": "2019-05-21T08:36:41Z",
"lastUpdateTime": "2019-05-21T08:37:24Z",
"message": "MXJob mxnet-8d1f211e is successfully completed.",
"reason": "MXJobSucceeded",
"status": "True",
"type": "Succeeded"
}
],
"mxReplicaStatuses": {
"Scheduler": {},
"Server": {},
"Worker": {}
},
"startTime": "2019-05-21T08:36:44Z"
}
}
Does this error occur accidentally or it must appear under some operation? I try some tests and find the status will go wrong only when worker break down and the scheduler completed at the same time. Does your scheduler stop with worker?
Scheduler completed will lead to the success of mxjob. It's possible when scheduler completed, the worker is still running instead of being error so the mxjob status is set to succeeded.