mxnet-operator icon indicating copy to clipboard operation
mxnet-operator copied to clipboard

the status of worker-0 is error, but the status of mxjob is Succeeded

Open jokerwenxiao opened this issue 5 years ago • 2 comments

kubeflow version: 0.5.0 mxnet-operator version: v1beta1

kubernetes dashboard displayimage

worker-0 log: INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) Traceback (most recent call last): File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in fit.fit(args, sym, get_mnist_iter) File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit (train, val) = data_loader(args, kv) File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter 'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz') File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data with gzip.open(os.path.join(args.data_dir,label)) as flbl: File "/opt/conda/lib/python3.6/gzip.py", line 53, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/opt/conda/lib/python3.6/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'

mxjob status:

{
  "status": {
        "completionTime": "2019-05-21T08:37:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:41Z",
                "message": "MXJob mxnet-8d1f211e is created.",
                "reason": "MXJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:46Z",
                "message": "MXJob mxnet-8d1f211e is running.",
                "reason": "MXJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:37:24Z",
                "message": "MXJob mxnet-8d1f211e is successfully completed.",
                "reason": "MXJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
        "mxReplicaStatuses": {
            "Scheduler": {},
            "Server": {},
            "Worker": {}
        },
        "startTime": "2019-05-21T08:36:44Z"
	}
}

jokerwenxiao avatar May 21 '19 09:05 jokerwenxiao

Does this error occur accidentally or it must appear under some operation? I try some tests and find the status will go wrong only when worker break down and the scheduler completed at the same time. Does your scheduler stop with worker?

KingOnTheStar avatar Jun 13 '19 12:06 KingOnTheStar

Scheduler completed will lead to the success of mxjob. It's possible when scheduler completed, the worker is still running instead of being error so the mxjob status is set to succeeded.

KingOnTheStar avatar Jun 13 '19 12:06 KingOnTheStar