Void Main
Void Main
Thanks @ictzyqq , it indeed solved my question. In short, the paper counts flops as MACs, so I should remove the 2 (multiply and addition) from `seq_len * 2 *...
Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could...
Hey community, here are some updates: - supported bf16 - supported triton decouple mode - verified that Llama 65B is working
> Hey, a tutorial on how to run LLaMA with the FasterTransformer backend would be really helpful! would be happy to contribute sure, will provide a step by step tutorial...
> This implement for llama is very meaningful and do you test the performance of this ? How fast can this be when compares with vanilla transformers api? I've been...
Some updates: - supported bf16 - supported triton decouple mode - verified that Llama 65B is working
Hi @byshiue , now that we have made much progress and verified on many models, I wonder if it is possible to get this PR reviewed / merged?
> I noticed that llama_example.cpp generates correct outputs in FP16, while triton does not. Does anyone know why? @michaelroyzen looks like the root cause is what @yinghai pointed out. merge...
> @void-main Hi, I'm also in Beijing and I'm a developer in AI inference. could I have your wechat? sure, try send me an email. :-)
Hi @CN-COTER , thanks for the contribution! really appreciate it! I've checked your code and started a review, could you please take a look. 🎉