minbpe
minbpe copied to clipboard
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
@karpathy , Thank you so much for the implementation. I have added setup.py to facilitate installation (using pip preferably). Can we add this please?
Automating testing by creating a work flow and testing on windows, macos and ubuntu with python 3.9, 3.10 ,3.11 ,3.12
@karpathy , thank you for another interesting educational project! This MR introduces `pyproject.toml` file to handle project metadata and dependencies in accordance with [PEP-621](https://peps.python.org/pep-0621/)[^1]. Using it with [`pdm`](https://pdm-project.org/)[^2] and its...
Since there are significant concerns about handling ``, there should be an example program that shows how to properly prepare input text for that case and pass to the train...
(will add to `minbpe-doc`) ```mermaid graph TD; classDef success fill:#5CB85C,stroke:#fff,color:#fff; classDef progress fill:#428BCA,stroke:#fff,color:#fff; classDef pending fill:#F0AD4E,stroke:#fff,color:#fff; VideoScript[Video Scripts] --> OutlineGeneration[Generate Outline]; subgraph IntegrationAndPreparation_Claude VideoScript --> PromptConstruction[Construct Prompt]; OutlineGeneration --> PromptConstruction;...
Hi Mentor Karpathy, I was wondering if minbpe can be scaled to support tokenizing video frames into embedded patches: say as proposed in SORA's technical report and VIT paper -...
Maybe I am completely wrong, but to me using something like bpe to build an encoding for text feels stupid. Sure, it is a fairly easy way and it will...
I just noticed that counting pairs might be slightly inaccurate for a lot of repeating tokens. For example in the sequence 1, 1, 1, 1 the pair (1, 1) gets...
Accidentally encountered a `ValueError: max() arg is an empty sequence` when attempting to test on a small piece of text with a (maybe) large `vocab_size`.
@karpathy First of all, thank you so much for sharing your knowledge. I updated the initialization of self.vocab because I don't feel we need to call self._build_vocab(). I also cleaned...