enas icon indicating copy to clipboard operation
enas copied to clipboard

Question about output value

Open sweetdream33 opened this issue 6 years ago • 11 comments

Thank you for providing us with the code. I'm running the cifar10_macro_search.sh The code, data, hyper parameter were taken as is and not modified. tail of cifar10_macro_search.sh(in the stdout file) looks like :


[2] [3 0] [0 1 0] [3 1 0 1] [0 0 0 0 0] [0 1 0 1 0 1] [2 1 1 0 1 0 0] [1 1 0 0 1 0 0 0] [1 0 0 0 0 1 1 0 0] [2 0 0 1 0 0 0 0 0 0] [1 0 0 0 0 1 0 0 1 1 0] [5 0 1 0 0 0 1 0 0 0 1 0] val_acc=0.7734

[2] [1 0] [1 0 0] [5 0 0 1] [0 0 1 0 0] [2 0 1 0 0 0] [2 1 1 0 0 0 1] [3 0 0 0 1 0 1 1] [3 0 1 0 0 0 1 0 1] [5 1 0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 0 1 0 1] [1 1 0 0 0 1 1 1 0 1 1 1] val_acc=0.8672

[4] [4 0] [2 0 1] [0 0 0 1] [0 0 0 0 0] [0 0 0 0 1 1] [0 0 0 0 0 0 0] [5 1 0 0 0 0 1 0] [5 1 0 0 0 0 0 0 1] [5 0 0 1 0 1 1 0 0 0] [1 1 1 1 0 1 0 0 0 0 0] [0 0 0 1 1 0 0 0 1 0 0 1] val_acc=0.7422

[0] [4 1] [1 1 0] [1 0 0 0] [5 0 1 1 0] [2 1 0 0 1 0] [0 0 0 1 1 1 1] [5 0 1 1 1 0 0 0] [4 0 0 0 0 0 0 1 0] [4 0 0 0 1 0 0 0 0 1] [5 1 0 0 0 1 0 0 1 0 0] [0 0 0 0 0 0 1 0 0 0 0 0] val_acc=0.8516

Epoch 310: Eval Eval at 109120 valid_accuracy: 0.8068 Eval at 109120 test_accuracy: 0.7946

Now I think I'm supposed to take the architecture the best validation, which is: [2] [1 0] [1 0 0] [5 0 0 1] [0 0 1 0 0] [2 0 1 0 0 0] [2 1 1 0 0 0 1] [3 0 0 0 1 0 1 1] [3 0 1 0 0 0 1 0 1] [5 1 0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 0 1 0 1] [1 1 0 0 0 1 1 1 0 1 1 1] val_acc=0.8672

Is that right? Is this result the end of the discovered network(optimal one) using macro search? Validation accuracy of selected architecture is 86.72%. This result is lower than 96.1%. What's the reason? Do I have to modify parameters ( num_epochs, batch_size,...)?

sweetdream33 avatar Apr 05 '18 08:04 sweetdream33

Yes, your output looks correct. However, you need to take the architecture recommended by ENAS to retrain from scratch in order to achieve a good accuracy. Details were discussed in our paper, and also in this issue.

hyhieu avatar Apr 05 '18 14:04 hyhieu

I'm confused about cifar10_macro_final.sh . In the original paper, DAG size of macro search space is set to 12. But $fixed_arc in cifar10_macro_final.sh seems more than 12 layers?

zeus7777777 avatar Apr 08 '18 07:04 zeus7777777

We just reran different experiments, which resulted in different architectures. You can just set the DAG size to 12 as in the paper.

hyhieu avatar Apr 08 '18 12:04 hyhieu

Hi, when I set the DAG size to 12, I have met an Error.The fixed_arc used is like this:

fixed_arc="1"
fixed_arc="$fixed_arc 3 0"
fixed_arc="$fixed_arc 0 0 1"
fixed_arc="$fixed_arc 4 0 1 0"
fixed_arc="$fixed_arc 0 0 0 1 0"
fixed_arc="$fixed_arc 5 0 0 0 0 0"
fixed_arc="$fixed_arc 0 0 0 0 1 0 0"
fixed_arc="$fixed_arc 2 0 1 1 1 0 1 0"
fixed_arc="$fixed_arc 4 0 0 1 0 0 0 1 0"
fixed_arc="$fixed_arc 0 0 0 0 0 1 1 0 0 0"
fixed_arc="$fixed_arc 2 1 1 1 0 0 0 1 0 0 0"
fixed_arc="$fixed_arc 4 0 1 1 1 0 0 0 0 1 0 0"

(which val_acc = 0.8672) and modified --child_num_layers=12 --child_out_filters=36

but it raised an UnboundLocalError: Traceback (most recent call last): File "src/cifar10/main.py", line 359, in tf.app.run() File "/home/rongshenghai/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "src/cifar10/main.py", line 355, in main train() File "src/cifar10/main.py", line 223, in train ops = get_ops(images, labels) File "src/cifar10/main.py", line 190, in get_ops child_model.connect_controller(None) File "/home/rongshenghai/tensorflow/enas/src/cifar10/general_child.py", line 705, in connect_controller self._build_train() File "/home/rongshenghai/tensorflow/enas/src/cifar10/general_child.py", line 595, in _build_train logits = self._model(self.x_train, is_training=True) File "/home/rongshenghai/tensorflow/enas/src/cifar10/general_child.py", line 212, in _model x = self._fixed_layer(layer_id, layers, start_idx, out_filters, is_training) File "/home/rongshenghai/tensorflow/enas/src/cifar10/general_child.py", line 465, in _fixed_layer prev = res_layers + [out] UnboundLocalError: local variable 'out' referenced before assignment

But when I used the original cifar10_macro_final.sh which DAG size is 24, it ran well.

ShenghaiRong avatar Apr 08 '18 12:04 ShenghaiRong

Thank you for your answer. I'm running the cifar10_macro_final.sh. tail of cifar10_macro_final.sh(in the stdout file) looks like :

epoch=306 ch_step=153050 loss=0.001949 lr=0.0011 |g|=0.2311 tr_acc=100/100 mins=2119.42 epoch=306 ch_step=153100 loss=0.001135 lr=0.0011 |g|=0.0480 tr_acc=100/100 mins=2120.06
epoch=306 ch_step=153150 loss=0.001102 lr=0.0011 |g|=0.0725 tr_acc=100/100 mins=2120.71
epoch=306 ch_step=153200 loss=0.001061 lr=0.0011 |g|=0.0932 tr_acc=100/100 mins=2121.35
epoch=306 ch_step=153250 loss=0.001357 lr=0.0011 |g|=0.0557 tr_acc=100/100 mins=2122.00
epoch=306 ch_step=153300 loss=0.001437 lr=0.0011 |g|=0.1014 tr_acc=100/100 mins=2122.64
epoch=306 ch_step=153350 loss=0.001134 lr=0.0011 |g|=0.0374 tr_acc=100/100 mins=2123.29
epoch=306 ch_step=153400 loss=0.001238 lr=0.0011 |g|=0.0612 tr_acc=100/100 mins=2123.93
epoch=306 ch_step=153450 loss=0.001358 lr=0.0011 |g|=0.0592 tr_acc=100/100 mins=2124.58
epoch=307 ch_step=153500 loss=0.000936 lr=0.0011 |g|=0.0476 tr_acc=100/100 mins=2125.22
Epoch 307: Eval Eval at 153500 test_accuracy: 0.9601 epoch=307 ch_step=153550 loss=0.001563 lr=0.0010 |g|=0.1156 tr_acc=100/100 mins=2126.32
epoch=307 ch_step=153600 loss=0.002924 lr=0.0010 |g|=0.4933 tr_acc=100/100 mins=2126.97
epoch=307 ch_step=153650 loss=0.001787 lr=0.0010 |g|=0.1428 tr_acc=100/100 mins=2127.61
epoch=307 ch_step=153700 loss=0.001629 lr=0.0010 |g|=0.1128 tr_acc=100/100 mins=2128.26
epoch=307 ch_step=153750 loss=0.001239 lr=0.0010 |g|=0.1101 tr_acc=100/100 mins=2128.90
epoch=307 ch_step=153800 loss=0.001421 lr=0.0010 |g|=0.0812 tr_acc=100/100 mins=2129.55
epoch=307 ch_step=153850 loss=0.001244 lr=0.0010 |g|=0.0784 tr_acc=100/100 mins=2130.19
epoch=307 ch_step=153900 loss=0.001778 lr=0.0010 |g|=0.1270 tr_acc=100/100 mins=2130.84
epoch=307 ch_step=153950 loss=0.001900 lr=0.0010 |g|=0.1715 tr_acc=100/100 mins=2131.48
epoch=308 ch_step=154000 loss=0.001303 lr=0.0010 |g|=0.0811 tr_acc=100/100 mins=2132.13
Epoch 308: Eval Eval at 154000 test_accuracy: 0.9605 epoch=308 ch_step=154050 loss=0.008866 lr=0.0010 |g|=1.3467 tr_acc=100/100 mins=2133.23
epoch=308 ch_step=154100 loss=0.001046 lr=0.0010 |g|=0.0328 tr_acc=100/100 mins=2133.88
epoch=308 ch_step=154150 loss=0.001344 lr=0.0010 |g|=0.0558 tr_acc=100/100 mins=2134.52
epoch=308 ch_step=154200 loss=0.001324 lr=0.0010 |g|=0.0485 tr_acc=100/100 mins=2135.17
epoch=308 ch_step=154250 loss=0.001197 lr=0.0010 |g|=0.0587 tr_acc=100/100 mins=2135.81
epoch=308 ch_step=154300 loss=0.001323 lr=0.0010 |g|=0.0478 tr_acc=100/100 mins=2136.46
epoch=308 ch_step=154350 loss=0.000928 lr=0.0010 |g|=0.0559 tr_acc=100/100 mins=2137.10
epoch=308 ch_step=154400 loss=0.000729 lr=0.0010 |g|=0.0184 tr_acc=100/100 mins=2137.75
epoch=308 ch_step=154450 loss=0.001168 lr=0.0010 |g|=0.0731 tr_acc=100/100 mins=2138.39
epoch=309 ch_step=154500 loss=0.000755 lr=0.0010 |g|=0.0169 tr_acc=100/100 mins=2139.04
Epoch 309: Eval Eval at 154500 test_accuracy: 0.9589 epoch=309 ch_step=154550 loss=0.000859 lr=0.0010 |g|=0.0329 tr_acc=100/100 mins=2140.14
epoch=309 ch_step=154600 loss=0.003031 lr=0.0010 |g|=1.0015 tr_acc=100/100 mins=2140.79
epoch=309 ch_step=154650 loss=0.001678 lr=0.0010 |g|=0.2013 tr_acc=100/100 mins=2141.44
epoch=309 ch_step=154700 loss=0.000810 lr=0.0010 |g|=0.0335 tr_acc=100/100 mins=2142.08
epoch=309 ch_step=154750 loss=0.001312 lr=0.0010 |g|=0.1542 tr_acc=100/100 mins=2142.73
epoch=309 ch_step=154800 loss=0.001046 lr=0.0010 |g|=0.0383 tr_acc=100/100 mins=2143.37
epoch=309 ch_step=154850 loss=0.001397 lr=0.0010 |g|=0.1267 tr_acc=100/100 mins=2144.02
epoch=309 ch_step=154900 loss=0.001565 lr=0.0010 |g|=0.0715 tr_acc=100/100 mins=2144.66
epoch=309 ch_step=154950 loss=0.001706 lr=0.0010 |g|=0.1621 tr_acc=100/100 mins=2145.31
epoch=310 ch_step=155000 loss=0.001614 lr=0.0010 |g|=0.0683 tr_acc=100/100 mins=2145.95
Epoch 310: Eval Eval at 155000 test_accuracy: 0.9602


Now I think I'm supposed to outputs has the highest test accuracy at epoch 308, which is:

Epoch 308: Eval Eval at 154000 test_accuracy: 0.9605

Is this outputs the test accuracy which is applied test data with the optimal parameters(set in macro_final.sh) and architecture you found?

Where are the classification results stored for the test data?

sweetdream33 avatar Apr 09 '18 00:04 sweetdream33

Same issue as @ShenghaiRong

Besides his modification, I change --child_num_branches=4 to --child_num_branches=6 to match training setting in cifar10_macro_search.sh.

BTW, the architecture between cifar10_macro_search.sh and cifar10_macro_final.sh is different, need to be documented ?

zeus7777777 avatar Apr 10 '18 12:04 zeus7777777

@zeus7777777 hi, when I change --child_num_branches=4 to --child_num_branches=6 , it still raised the same error. Did you run the cifar10_macro_final.sh in which the DAG size is 12 without any error?

ShenghaiRong avatar Apr 10 '18 14:04 ShenghaiRong

No, I got the same error. Maybe it's due to https://github.com/melodyguan/enas/blob/master/src/cifar10/general_child.py#L406 which doesn't implement pooling operation.

zeus7777777 avatar Apr 10 '18 15:04 zeus7777777

@sweetdream33 @hyhieu Thanks for your work. I have successfully run all the .sf files. But I'm confused by all the parameters. Would you please give an exactly explanation? Thanks very much. DEFINE_integer("batch_size", 32, "")

DEFINE_integer("num_epochs", 300, "") DEFINE_integer("child_lr_dec_every", 100, "") DEFINE_integer("child_num_layers", 5, "") DEFINE_integer("child_num_cells", 5, "") DEFINE_integer("child_filter_size", 5, "") DEFINE_integer("child_out_filters", 48, "") DEFINE_integer("child_out_filters_scale", 1, "") DEFINE_integer("child_num_branches", 4, "") DEFINE_integer("child_num_aggregate", None, "") DEFINE_integer("child_num_replicas", 1, "") DEFINE_integer("child_block_size", 3, "") DEFINE_integer("child_lr_T_0", None, "for lr schedule") DEFINE_integer("child_lr_T_mul", None, "for lr schedule") DEFINE_integer("child_cutout_size", None, "CutOut size") DEFINE_float("child_grad_bound", 5.0, "Gradient clipping") DEFINE_float("child_lr", 0.1, "") DEFINE_float("child_lr_dec_rate", 0.1, "") DEFINE_float("child_keep_prob", 0.5, "") DEFINE_float("child_drop_path_keep_prob", 1.0, "minimum drop_path_keep_prob") DEFINE_float("child_l2_reg", 1e-4, "") DEFINE_float("child_lr_max", None, "for lr schedule") DEFINE_float("child_lr_min", None, "for lr schedule") DEFINE_string("child_skip_pattern", None, "Must be ['dense', None]") DEFINE_string("child_fixed_arc", None, "") DEFINE_boolean("child_use_aux_heads", False, "Should we use an aux head") DEFINE_boolean("child_sync_replicas", False, "To sync or not to sync.") DEFINE_boolean("child_lr_cosine", False, "Use cosine lr schedule")

DEFINE_float("controller_lr", 1e-3, "") DEFINE_float("controller_lr_dec_rate", 1.0, "") DEFINE_float("controller_keep_prob", 0.5, "") DEFINE_float("controller_l2_reg", 0.0, "") DEFINE_float("controller_bl_dec", 0.99, "") DEFINE_float("controller_tanh_constant", None, "") DEFINE_float("controller_op_tanh_reduce", 1.0, "") DEFINE_float("controller_temperature", None, "") DEFINE_float("controller_entropy_weight", None, "") DEFINE_float("controller_skip_target", 0.8, "") DEFINE_float("controller_skip_weight", 0.0, "") DEFINE_integer("controller_num_aggregate", 1, "") DEFINE_integer("controller_num_replicas", 1, "") DEFINE_integer("controller_train_steps", 50, "") DEFINE_integer("controller_forwards_limit", 2, "") DEFINE_integer("controller_train_every", 2, "train the controller after this number of epochs") DEFINE_boolean("controller_search_whole_channels", False, "") DEFINE_boolean("controller_sync_replicas", False, "To sync or not to sync.") DEFINE_boolean("controller_training", True, "") DEFINE_boolean("controller_use_critic", False, "")

DEFINE_integer("log_every", 50, "How many steps to log") DEFINE_integer("eval_every_epochs", 1, "How many epochs to eval")

axiniu avatar Apr 25 '18 02:04 axiniu

@sweetdream33 I see you have got the test_accuracy: 0.9605 similar to the paper what fix_arc do you use or you just run the cifar10_macro_final.sh without modify?

yogurfrul avatar Sep 12 '18 08:09 yogurfrul

@zeus7777777 Did you end up fixing that problem, after i added pooling layers at those branches, i got an error regarding the output channels. not sure whether we should *= 2 , like in the searching branch.

matthewygf avatar Jul 02 '19 15:07 matthewygf