pGit1
pGit1
My network learned from data. I used CIFAR 768 model but 1. It takes 5 minutes per epoch to train on a P600 2. My results were no where near...
That is weird. Not sure why that would be. In my example I used Cutout and some other augmentation techniques and was that far off. If their mode is that...
 Having a hard time interpreting this cloud with the dots in the middle of it and I cant find what it means in the paper. As a result I...
@Agent007 I am not sure I am following. So h_sub_i is the concatenated output of h_sub_i-1 and the dotted lines represents the concatenated output of h_sub_i-2??
@oobabooga sorry for the bad question, but how do we get these updates? Not sure which library to ```pip install```.
Sorry. I dont actually have this repo installed. From my research looks like the latest iteration of PEFT needs to be pulled down. Thanks for your help! I am going...
@ItsLogic Can you show what your trainer args and hyper params are for the 13B training run? My models seem to take WAY longer than 10 hours to train. on...
@ItsLogic nevermind. The longer training time definitely stemmed from cutoff len going from 256 to 512.
@zhangfaen I think ALL of this "supervised" finetuning confusion stems from **annoying** use of terms on part of the community as popularized by the "SFT" portion of this paper: https://openreview.net/pdf?id=TG8KACxEONSee...
@zhangfaen My above answer is mostly correct. I answered my own question. All these people are doing is next word prediction in standard "teacher forcing" setup. Its just all **obfuscated**...