fast-autoaugment
fast-autoaugment copied to clipboard
Stuck after iteration
After the iterative search in the parameter space is completed, it will get stuck, and there is no error message (399 is the last iteration).
iter 397 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 198), ('PENDING', 1), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:31,787 WARNING logger.py:126 -- Couldn't import TensorFlow - disabling TensorBoard logging.
2021-05-07 16:49:31,787 WARNING logger.py:220 -- Could not instantiate <class 'ray.tune.logger.TFLogger'> - skipping.
iter 398 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 198), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]
2021-05-07 16:49:48,651 INFO ray_trial_executor.py:178 -- Destroying actor for trial search_par_resnet50_fold1_ratio0.4_200_cv_fold=1,cv_ratio_test=0.4,dataroot=_home_ccf_project_SB_PAR_data_rapv2_,level_0_0=0.77372,level_0_1=0.45162,level_1_0=0.00049368,level_1_1=0.39083,level_2_0=0.46218,level_2_1=0.69141,level_3_0=0.0028208,level_3_1=0.27047,level_4_0=0.65674,level_4_1=0.84919,num_op=2,num_policy=5,policy_0_0=3,policy_0_1=7,policy_1_0=0,policy_1_1=10,policy_2_0=13,policy_2_1=7,policy_3_0=5,policy_3_1=12,policy_4_0=11,policy_4_1=1,prob_0_0=0.40213,prob_0_1=0.39349,prob_1_0=0.47788,prob_1_1=0.63856,prob_2_0=0.6497,prob_2_1=0.50779,prob_3_0=0.58183,prob_3_1=0.30122,prob_4_0=0.62576,prob_4_1=0.92233,save_path=_home_ccf_project_fastautoaugment_models_par_resnet50_ratio0.4_fold1.model. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
iter 399 ma=0.509 OrderedDict([('RUNNING', 1), ('TERMINATED', 199), ('PENDING', 0), ('PAUSED', 0), ('ERROR', 0)]
I found that if the following errors are reported before the iteration is completed, it will get stuck. If there are no errors, it can continue to next stage.
iter 364 ma=0.509 OrderedDict([('RUNNING', 2), ('TERMINATED', 181), ('PENDING', 17), ('PAUSED', 0), ('ERROR', 0)
(pid=45772) WARNING: Logging before InitGoogleLogging() is written to STDERR
(pid=45772) E0507 16:44:34.685359 45846 raylet_client.cc:345] IOError: [RayletClient] Connection closed unexpectedly. [RayletClient] Failed to push profile events.
ray==0.6.5
python=3.6.9
tensorflow not install
centos7
any suggestion for the IOError?