Blake

Results 121 comments of Blake

Thanks for the insight. Also, very impressive work.

Pretty sure the answer is no due to how positional encoding is done.

@jon-tow So models like GPTJ can be finetuned and generate more than their sequence length? Whenever I try to generate sequences for GPTJ I have issues. Maybe that is something...

@jon-tow Using that prompt format for the base model will help? Perhaps you are talking about the tuned model?

Had the same thought. Have you figured it out? I didn't see anything in the paper either. If you want to add new tokens, you need to target the lm_head...

@artidoro @TimDettmers some insight on this would be greatly appreciated.

I am installing triton with the following inside a docker container: ```pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python``` I am also using flash-attn==1.0.5 For generating 2048 tokens on my RTX 3090 its actually seemingly...

@abhi-mosaic I changed my approach and I am no longer installing flash attention separately, but rather installing the needed code from the source using the ```pip install -e ."[gpu]" method....

Using an input of 1500 tokens and generating the remaining 548, I got a generation time of 14.4 seconds for torch implementation and a time of 16 seconds when using...

Yes thank you! Perhaps adding a link to the README may be a good idea for others?