tenset transfer-tune option tries only one-round for each task

Hello everyone.

We are interested in optimizing vision DNN models for Jetson devices, so we tried to use TenSet dataset for optimizing DNN models on Jetson Xavier NX. With some modification to use auto_scheduler.RPCRunner in tune_network.py, we were able to tune networks on Jetson Xavier NX. We evaluated models available in tune_network.py with --n-trials 10000, and we found that Ansor with pretrained model by TenSet founds good programs than Ansor without TenSet model in the first thousands of trials and final results are slightly better.

We expected that enabling --transfer-tune option makes the result better, because --trasfer-tune looks to improve cost models by using measured result on real device. However, when we tried --transfer-tune option, it caused slower programs in all models. The following are the results:

ResNet 18

compiler	execution time	# trials
w/o transfer tune	7.21	10044
w/ transfer tune	10.24	1692

ResNet 50

compiler	execution time	# trials
w/o transfer tune	15.56	10044
w/ transfer tune	22.62	1692

MobileNet v2

compiler	execution time	# trials
w/o transfer tune	2.49	10048
w/ transfer tune	2.9	2048

MobileNet v3

compiler	execution time	# trials
w/o transfer tune	3.06	10048
w/ transfer tune	3.53	3328

Wide ResNet 50

compiler	execution time	# trials
w/o transfer tune	35.32	10044
w/ transfer tune	48.8	1692

DenseNet 121

compiler	execution time	# trials
w/o transfer tune	15.61	10044
w/ transfer tune	17.88	4604

Inception v3

compiler	execution time	# trials
w/o transfer tune	29.08	10015
w/ transfer tune	45.46	3487

We use the following commands for evaluation:

n_trials=10000
target="cuda -keys=cudagpu -arch=sm_72 -max_num_threads=1024 -max_threads_per_block=1024 -registers_per_block=65536 -shared_memory_per_block=49152 -thread_warp_size=32"
target_host="llvm -keys=arm_cpu -mtriple=aarch64-linux-gnu -mattr=+neon"
# w/o transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --load-model xgb.pkl --target "$target" --target-host "$target_host"
# w/ transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --transfer-tune --load-model xgb.pkl --target "$target" --target-host "$target_host"

For investigating the slower results of transfer tune, we read the related code to transfer tune option and found some seemingly strange points in its implementation:

It only tune each task for only one round, even if we give much more trial counts. In ResNet 50, normal Ansor w/ TenSet model tries 10044 trials, but transfer tune only does 1692 times.
It only uses fine-tuned models for the last half of tasks. The first half of tasks are always tuned by the given model.

Could you please tell me the intension of these implementation or how to improve the result of transfer tuning?

Oct 11 '21 06:10 iasakura

@ruochen99

Oct 24 '21 18:10 merrymercy

Thank you for bringing up this issue! Transfer learning is not a complete feature in our model yet. The purpose of the --transfer-tune option is mostly to test how useful transfer learning is in terms of improving the cost model. We have only tested the effect of transfer learning on a small number of trials so we ignored the later parts of this procedure. A quick fix to your problem could be changing this line

self.num_measures_per_round = min(tune_option.num_measures_per_round, num_measure_trials // len(self.tasks))

into

self.num_measures_per_round = num_measure_trials // len(self.tasks)

After this modification, the algorithm will collect measurement data on the first half of tasks, train a local model, and use it on the second half of tasks. However, I'm also uncertain about how well transfer learning would perform on a large number of trials.

Oct 26 '21 10:10 ruochen99

tenset tenset copied to clipboard

transfer-tune option tries only one-round for each task

ResNet 18

ResNet 50

MobileNet v2

MobileNet v3

Wide ResNet 50

DenseNet 121

Inception v3

tenset
tenset copied to clipboard