Yi-Chen (Howard) Lo

Results 89 comments of Yi-Chen (Howard) Lo
trafficstars

### How to Tune Hyperparameters Gamma and C? (Response by Christopher P. Burgess) Gamma sets the strength of the penalty for deviating from the target KL, C. Here you want...

### Summary This paper purposes to train neural machine translation (NMT) model **from scratch** on both bilingual and target-side monolingual data in a multi-task setting by modifying the decoder in...

### Summary Neural text generation models are typically auto-regressive, trained via maximizing likelihood or perplexity and evaluate with validation perplexity. They claim such training and evaluation could result in poor...

### Summary This work can be viewed as a follow-up work on Rei (2017) #3, which porposes a semi-supervised framework for multi-task learning, integrating language modeling as an additional objective,...

### Summary This paper purpose a novel recurrent, attention and memory based neural architecture "Memory, Attention and Composition (MAC) cell" for VQA, featuring multi-step reasoning. Three operations in MAC cell:...

### TL;DR This papers identifies several key obstacles in machine learning research and gives some inspirations for machine learning research that matters, aiming to address the gap between research and...

### Summary This paper is the first group to successfully employ a fully convolutional Seq2Seq (Conv Seq2Seq) model (Gehring et al., 2017) on grammatical error correction (GEC) with some improvements:...

### Prior Approaches on Knowledge Distillation Basically, knowledge distillation aims to obtain a smaller student model from typically larger teacher model by matching their information hidden in the model. The...

### Metadata: Prime-Aware Adaptive Distillation - Author: Youcai Zhang, Zhonghao Lan, +4, Yichen Wei - Organization: Megvii Inc. & University of Science and Technology of China & Tongji University -...

### Highlights ![](https://cdn-images-1.medium.com/max/1600/1*yd2GPCLG-Ithrj-glB5R-g.png) * This paper incorporates data uncertainty for re-weighting samples for distillation. * They conjecture that previous hard-mining (i.e., focus more on hard samples instead of easier ones)...