No glat_sd arch
Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!
Hello @bbo0924 .
Yes, you are right. It should be nat_sd_glat. Sorry for the mistake, I will fix it. Thanks.
The meaning of ss and sd was used for development, which I should have changed after writing the paper.
So ss means schedule sampling, where I mix the ground truth tokens with predicted tokens. The s is a notation for layer-wise prediction, but I don't really remember why I used s. d means deep supervision.