pytorch-llama
pytorch-llama copied to clipboard
LLaMA 2 implemented from scratch in PyTorch
First of all, thank you for the great resources and Youtube videos. I wanted to point out that in slide 25 of the Llama notes, regarding the computational efficient realization...
what is minimal computer can be used for only inference i do have ubuntu with 3060 GPU 8 GB can I use it?
will it work on windows OS with only CPU?
https://github.com/hkproj/pytorch-llama/blob/067f8a37fe36ac8b52dca9cc6f2a2e8d6aa372d6/model.py#L230-L235 No need to use forward method? I mean, we could use nn.Module directly. ``` h = x + self.attention(self.attention_norm(x), start_pos, freqs_complex) out = h + self.feed_forward(self.ffn_norm(h)) ```
Hello, Could you please advise me on how to disable the KV cache? I would also appreciate any guidance on how to implement this change in code. Thank you for...
The mask in https://github.com/hkproj/pytorch-llama/blob/067f8a37fe36ac8b52dca9cc6f2a2e8d6aa372d6/inference.py#L121 should be `~mask` since we want to select all those indices where value is less than p.
Why is causal attention mask not used?
in your source code  the first time in forward, you use tokens from 0:1 in each batch,this is not ture for llama2(decoder only) llama2 can be devided into prefill...