minbpe icon indicating copy to clipboard operation
minbpe copied to clipboard

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Results 50 minbpe issues
Sort by recently updated
recently updated
newest added

@karpathy , Thank you so much for the implementation. I have added setup.py to facilitate installation (using pip preferably). Can we add this please?

Automating testing by creating a work flow and testing on windows, macos and ubuntu with python 3.9, 3.10 ,3.11 ,3.12

@karpathy , thank you for another interesting educational project! This MR introduces `pyproject.toml` file to handle project metadata and dependencies in accordance with [PEP-621](https://peps.python.org/pep-0621/)[^1]. Using it with [`pdm`](https://pdm-project.org/)[^2] and its...

Since there are significant concerns about handling ``, there should be an example program that shows how to properly prepare input text for that case and pass to the train...

(will add to `minbpe-doc`) ```mermaid graph TD; classDef success fill:#5CB85C,stroke:#fff,color:#fff; classDef progress fill:#428BCA,stroke:#fff,color:#fff; classDef pending fill:#F0AD4E,stroke:#fff,color:#fff; VideoScript[Video Scripts] --> OutlineGeneration[Generate Outline]; subgraph IntegrationAndPreparation_Claude VideoScript --> PromptConstruction[Construct Prompt]; OutlineGeneration --> PromptConstruction;...

Hi Mentor Karpathy, I was wondering if minbpe can be scaled to support tokenizing video frames into embedded patches: say as proposed in SORA's technical report and VIT paper -...

Maybe I am completely wrong, but to me using something like bpe to build an encoding for text feels stupid. Sure, it is a fairly easy way and it will...

I just noticed that counting pairs might be slightly inaccurate for a lot of repeating tokens. For example in the sequence 1, 1, 1, 1 the pair (1, 1) gets...

Accidentally encountered a `ValueError: max() arg is an empty sequence` when attempting to test on a small piece of text with a (maybe) large `vocab_size`.

@karpathy First of all, thank you so much for sharing your knowledge. I updated the initialization of self.vocab because I don't feel we need to call self._build_vocab(). I also cleaned...