Andrej comments

Results 373 comments of


                                            Andrej

Minbpe as a potential course

Sure, if you can help with any of the above I'm happy to link to it. Ty!

Steal token visualisation code

Hey! Neat, I'll take a look :) Btw I noticed that in this line: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py#L97 The way you're merging during inference is that you're checking the rank of the merged...

Faster BPE

Thank you for the pointers! So I think we should retain the nice and clean and minimal and inefficient version as an algorithmic (and unit test) reference, but in addition...

Faster BPE

One additional model I'm thinking about is what I did with [llama2.c](https://github.com/karpathy/llama2.c), where instead of reviewing tons of PRs for the root repo, it is treated more as a reference...

Loading data from disk partially

Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.

Simplify generation of printable representation

are we sure these are equivalent

Batch encoding decoding

Is this part of some official API of tokenizers somewhere that you're trying to match? Otherwise if it's just a 2-line wrapper it's best done outside, in code and manually?

Fix small typos

kek is not a typo kek

Train BasicTokenizer on GPU with PyTorch, 100x speedup

Ok I'll step through this soon to take a look. Not sure that I love duplicating everything and creating torch versions of it. Would we be able to potentially isolate...

Video2Post Generation Workflow

Did you write this out yourself? :)