llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

Add `decode_gpt2.c` for decoding in C

Open martin-liu opened this issue 10 months ago • 4 comments

Summary

  • add decode_gpt2.c which generated by dev/decoder/decoder_gen.py
char* decode(int key) {
    static char s[][{max_len}] = {
         { '!', '\0' },
         { '"', '\0' },
         { '#', '\0' },
         { '$', '\0' },
         { '%', '\0' },
         ...
    };
    int numKeys = sizeof(s) / sizeof(s[0]);
    if (key < 0 || key >= numKeys) {
        return "";
    }
    return s[key];
}
  • use decode in train_gpt2.c and train_gpt2.cu
  • add check_decoder in test_gpt2.c and test_gpt2.cu

Testing

  • Tested c ones in macbook pro 2023 (Apple M2 Max)
  • Tested cuda ones in google colab with T4 GPU

martin-liu avatar Apr 12 '24 05:04 martin-liu

@karpathy

Appreciate the effort you've put into creating this repository.

I've noticed that our current setup requires decoding to be done through Python, which felt a bit cumbersome. To address this, I created this PR that introduces a Python script capable of generating a decode_gpt2.c file directly from tiktoken.

Could you please review the changes when you have a moment? Thank you!

martin-liu avatar Apr 12 '24 05:04 martin-liu

Oh wow, ok I see.

So you're autogenerating the code directly. (1)

As (2) I was thinking we would save the Tokenizer in some serialized format and then load it from C, malloc all the memory there, etc. The tokenizer could be similar to the one in llama2.c.

Any thoughts on the tradeoffs between these two approaches? It feels better to go down the path of (2)?

karpathy avatar Apr 12 '24 18:04 karpathy

@karpathy

Great point! I actually thought about this, but I opted for code generation (1) with a primary focus on ensuring transparency and educational value.

Integrating the tokenizer logic directly into the C code serves to demystify the decoding process, making it readily accessible and transparent to anyone interested in seeing the details.

Exploring the tokenizer in this manner reveals intriguing details, such as the presence of various unique tokens (e.g. —, ——, //, ////), which are quite interesting to observe.

The main trade-off might be maintainability. However, given that the decoding logic is stable and changes are rare, it should not be a concern.

What do you think?

martin-liu avatar Apr 12 '24 21:04 martin-liu

Another thing to keep in mind is this will bloat the repo size because we codegen the vocab into a root .c file. Still leaning to (2) but will sleep on it :)

karpathy avatar Apr 13 '24 00:04 karpathy

@karpathy, I've reconsidered the approach and now believe that using a readable serialization would be better.

Consequently, I've updated the code to produce a decode_gpt2.txt file that is both readable and loadable.

I've tested it again on my M2 MacBook and a Colab instance with T4 GPU.

Looking forward to your feedback!

martin-liu avatar Apr 13 '24 04:04 martin-liu

I'll take a look. I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates it.

karpathy avatar Apr 13 '24 16:04 karpathy

Tbh this looks complex. Just naively, shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C

karpathy avatar Apr 13 '24 16:04 karpathy

I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates it.

Makes sense, I can make the script to prepro_decoder.py, and generate the vocab in data folder.

shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C

I was trying to ensure readability by maintain a straightforward <line number - 1> : <token> mapping. The complexity comes from escaping non-printable characters. Your suggestion could be slightly simpler but not able to eliminate the complexity, because \0 is also one of the tokens (token id 188, which seems a little bit weird though), so escaping is still needed. Given these considerations, I believe the current implementation is a reasonable balance. What are your thoughts?

martin-liu avatar Apr 13 '24 21:04 martin-liu

I pushed the tokenizer I had in mind here c165855 . But I appreciate the work on this! Closing.

karpathy avatar Apr 15 '24 01:04 karpathy