tenset
tenset copied to clipboard
transfer-tune option tries only one-round for each task
Hello everyone.
We are interested in optimizing vision DNN models for Jetson devices, so we tried to use TenSet dataset for optimizing DNN models on Jetson Xavier NX.
With some modification to use auto_scheduler.RPCRunner
in tune_network.py
, we were able to tune networks on Jetson Xavier NX.
We evaluated models available in tune_network.py
with --n-trials 10000
, and we found that Ansor with pretrained model by TenSet founds good programs than Ansor without TenSet model in the first thousands of trials and final results are slightly better.
We expected that enabling --transfer-tune
option makes the result better, because --trasfer-tune
looks to improve cost models by using measured result on real device. However, when we tried --transfer-tune
option, it caused slower programs in all models. The following are the results:
ResNet 18
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 7.21 | 10044 |
w/ transfer tune | 10.24 | 1692 |
ResNet 50
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 15.56 | 10044 |
w/ transfer tune | 22.62 | 1692 |
MobileNet v2
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 2.49 | 10048 |
w/ transfer tune | 2.9 | 2048 |
MobileNet v3
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 3.06 | 10048 |
w/ transfer tune | 3.53 | 3328 |
Wide ResNet 50
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 35.32 | 10044 |
w/ transfer tune | 48.8 | 1692 |
DenseNet 121
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 15.61 | 10044 |
w/ transfer tune | 17.88 | 4604 |
Inception v3
compiler | execution time | # trials |
---|---|---|
w/o transfer tune | 29.08 | 10015 |
w/ transfer tune | 45.46 | 3487 |
We use the following commands for evaluation:
n_trials=10000
target="cuda -keys=cudagpu -arch=sm_72 -max_num_threads=1024 -max_threads_per_block=1024 -registers_per_block=65536 -shared_memory_per_block=49152 -thread_warp_size=32"
target_host="llvm -keys=arm_cpu -mtriple=aarch64-linux-gnu -mattr=+neon"
# w/o transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --load-model xgb.pkl --target "$target" --target-host "$target_host"
# w/ transfer tune
python3 tune_network.py --network ${model} --n-trials ${n_trials} ----cost-model xgb-no-update --transfer-tune --load-model xgb.pkl --target "$target" --target-host "$target_host"
For investigating the slower results of transfer tune, we read the related code to transfer tune option and found some seemingly strange points in its implementation:
- It only tune each task for only one round, even if we give much more trial counts. In ResNet 50, normal Ansor w/ TenSet model tries 10044 trials, but transfer tune only does 1692 times.
- It only uses fine-tuned models for the last half of tasks. The first half of tasks are always tuned by the given model.
Could you please tell me the intension of these implementation or how to improve the result of transfer tuning?
@ruochen99
Thank you for bringing up this issue! Transfer learning is not a complete feature in our model yet. The purpose of the --transfer-tune
option is mostly to test how useful transfer learning is in terms of improving the cost model. We have only tested the effect of transfer learning on a small number of trials so we ignored the later parts of this procedure. A quick fix to your problem could be changing this line
self.num_measures_per_round = min(tune_option.num_measures_per_round, num_measure_trials // len(self.tasks))
into
self.num_measures_per_round = num_measure_trials // len(self.tasks)
After this modification, the algorithm will collect measurement data on the first half of tasks, train a local model, and use it on the second half of tasks. However, I'm also uncertain about how well transfer learning would perform on a large number of trials.