Blake
Blake
Thanks for the insight. Also, very impressive work.
Pretty sure the answer is no due to how positional encoding is done.
@jon-tow So models like GPTJ can be finetuned and generate more than their sequence length? Whenever I try to generate sequences for GPTJ I have issues. Maybe that is something...
@jon-tow Using that prompt format for the base model will help? Perhaps you are talking about the tuned model?
Had the same thought. Have you figured it out? I didn't see anything in the paper either. If you want to add new tokens, you need to target the lm_head...
@artidoro @TimDettmers some insight on this would be greatly appreciated.
I am installing triton with the following inside a docker container: ```pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python``` I am also using flash-attn==1.0.5 For generating 2048 tokens on my RTX 3090 its actually seemingly...
@abhi-mosaic I changed my approach and I am no longer installing flash attention separately, but rather installing the needed code from the source using the ```pip install -e ."[gpu]" method....
Using an input of 1500 tokens and generating the remaining 548, I got a generation time of 14.4 seconds for torch implementation and a time of 16 seconds when using...
Yes thank you! Perhaps adding a link to the README may be a good idea for others?