llm.c
llm.c copied to clipboard
Add `decode_gpt2.c` for decoding in C
Summary
- add
decode_gpt2.c
which generated bydev/decoder/decoder_gen.py
char* decode(int key) {
static char s[][{max_len}] = {
{ '!', '\0' },
{ '"', '\0' },
{ '#', '\0' },
{ '$', '\0' },
{ '%', '\0' },
...
};
int numKeys = sizeof(s) / sizeof(s[0]);
if (key < 0 || key >= numKeys) {
return "";
}
return s[key];
}
- use
decode
intrain_gpt2.c
andtrain_gpt2.cu
- add
check_decoder
intest_gpt2.c
andtest_gpt2.cu
Testing
- Tested c ones in macbook pro 2023 (Apple M2 Max)
- Tested cuda ones in google colab with T4 GPU
@karpathy
Appreciate the effort you've put into creating this repository.
I've noticed that our current setup requires decoding to be done through Python, which felt a bit cumbersome. To address this, I created this PR that introduces a Python script capable of generating a decode_gpt2.c
file directly from tiktoken
.
Could you please review the changes when you have a moment? Thank you!
Oh wow, ok I see.
So you're autogenerating the code directly. (1)
As (2) I was thinking we would save the Tokenizer in some serialized format and then load it from C, malloc all the memory there, etc. The tokenizer could be similar to the one in llama2.c.
Any thoughts on the tradeoffs between these two approaches? It feels better to go down the path of (2)?
@karpathy
Great point! I actually thought about this, but I opted for code generation (1) with a primary focus on ensuring transparency and educational value.
Integrating the tokenizer logic directly into the C code serves to demystify the decoding process, making it readily accessible and transparent to anyone interested in seeing the details.
Exploring the tokenizer in this manner reveals intriguing details, such as the presence of various unique tokens (e.g. —, ——, //, ////), which are quite interesting to observe.
The main trade-off might be maintainability. However, given that the decoding logic is stable and changes are rare, it should not be a concern.
What do you think?
Another thing to keep in mind is this will bloat the repo size because we codegen the vocab into a root .c file. Still leaning to (2) but will sleep on it :)
@karpathy, I've reconsidered the approach and now believe that using a readable serialization
would be better.
Consequently, I've updated the code to produce a decode_gpt2.txt
file that is both readable and loadable.
I've tested it again on my M2 MacBook and a Colab instance with T4 GPU.
Looking forward to your feedback!
I'll take a look. I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates it.
Tbh this looks complex. Just naively, shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C
I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates it.
Makes sense, I can make the script to prepro_decoder.py
, and generate the vocab in data
folder.
shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C
I was trying to ensure readability by maintain a straightforward <line number - 1> : <token>
mapping. The complexity comes from escaping non-printable characters.
Your suggestion could be slightly simpler but not able to eliminate the complexity, because \0
is also one of the tokens (token id 188
, which seems a little bit weird though), so escaping is still needed.
Given these considerations, I believe the current implementation is a reasonable balance. What are your thoughts?
I pushed the tokenizer I had in mind here c165855 . But I appreciate the work on this! Closing.