pytorch-llama issues

Error in rotary matrix multiplication formula of slide 25

First of all, thank you for the great resources and Youtube videos. I wanted to point out that in slide 25 of the Llama notes, regarding the computational efficient realization...

sanzgadea

what is minimal computer can be used for only inference

what is minimal computer can be used for only inference i do have ubuntu with 3060 GPU 8 GB can I use it?

Sandy4321

will it work on windows OS with only CPU?

Sandy4321

No need to use forward method?

https://github.com/hkproj/pytorch-llama/blob/067f8a37fe36ac8b52dca9cc6f2a2e8d6aa372d6/model.py#L230-L235 No need to use forward method? I mean, we could use nn.Module directly. ``` h = x + self.attention(self.attention_norm(x), start_pos, freqs_complex) out = h + self.feed_forward(self.ffn_norm(h)) ```

nkkbr

Can I turn off KV cache?

2

Hello, Could you please advise me on how to disable the KV cache? I would also appreciate any guidance on how to implement this change in code. Thank you for...

purejomo

refactor to pass cmd args

honghua

Fixup

honghua

Mask while decoding in `_sample_top_p`

The mask in https://github.com/hkproj/pytorch-llama/blob/067f8a37fe36ac8b52dca9cc6f2a2e8d6aa372d6/inference.py#L121 should be `~mask` since we want to select all those indices where value is less than p.

ovshake

causal attention mask

Why is causal attention mask not used?

itera-del

Some questions about your source code

in your source code ![image](https://github.com/user-attachments/assets/31fac10b-8019-4c78-ab7b-d65f910ce866) the first time in forward, you use tokens from 0:1 in each batch，this is not ture for llama2(decoder only) llama2 can be devided into prefill...

shawnnjupt

pytorch-llama
pytorch-llama copied to clipboard

Metadata

Error in rotary matrix multiplication formula of slide 25

what is minimal computer can be used for only inference

will it work on windows OS with only CPU?

No need to use forward method?

Can I turn off KV cache?

refactor to pass cmd args

Fixup

Mask while decoding in `_sample_top_p`

causal attention mask

Some questions about your source code

← Metadata

Owner

Metadata

pytorch-llama pytorch-llama copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch-llama
pytorch-llama copied to clipboard