llama2.c
llama2.c copied to clipboard
Inference Llama 2 in one file of pure C
Dear Andrej Could you enable discussions for this repo. That would help folks to ask questions in discussions instead of issues. For example I have a few questions regarding running...
- new method to initialize tokenizer with a given vocab_size - removed voacb_size from the arguments of build_tokenizer - applied the changes in run.c, runq.c, test.c - pass the tokenizer...
Some dumb people like me might misread the code and not see we need to pip install the requirements.txt, this will remind the user
It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. https://github.com/deepseek-ai/DeepSeek-MoE/tree/main https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat
Modify the original attention ``` class Attention(nn.Module): def __init__(self, args: ModelArgs): super().__init__() self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads assert args.n_heads % self.n_kv_heads == 0 model_parallel_size = 1...
allocate only one scaling factor per group
@karpathy - thank you for the great software. I wrote up a visual walk-through of how it all works in detail. I think I got it all right and am...
Here is the port of llama2.c to pure JavaScript for React native (mobile). [https://github.com/hootan09/llamajs_rn](https://github.com/hootan09/llamajs_rn) 
Hi everyone, I am trying to understand the usage of the "multiple_of" parameter. I understand the purpose of the parameter. However, the code is not doing what it is supposed...