Robert Knight
Robert Knight
This is now implemented.
`ignore_merges` was added in https://github.com/huggingface/tokenizers/commit/914576f7edcba7116b3d22fd0f2b8f727d652c3a. See also https://github.com/huggingface/tokenizers/pull/1493/files. The [documentation](https://huggingface.co/docs/tokenizers/en/api/models#tokenizers.models.BPE) says: > ignore_merges (bool, optional) — Whether or not to match tokens with the vocab before using merges.
It turns out that after fixing the data structures used by the BPE encoder in https://github.com/robertknight/rten/pull/1041, `ignore_merges` support is not critical to getting a basic Llama 3 example working, though...
The llama3 tokenizer's vocabulary has 588 tokens out of 128000, or
Taking "Ġsoubor" as an example, the decoded string is " soubor". Taking the token IDs for the individual bytes and applying merge rules, the encoded IDs are [274, 5599, 269]...
Related comment: https://github.com/hypothesis/client/issues/6859#issuecomment-2700638472
I noticed a slight rendering glitch where there are white pixels behind the top-left and top-right rounded corners: I see this in Chrome, Safari and Firefox.
> I noticed a slight rendering glitch where there are white pixels behind the top-left and top-right rounded corners: Setting `border-radius: 0.25rem` on the outer `popover` element, to match the...
I found an issue that the Popover styling is broken when the browser doesn't support native popovers, tested via `asNativePopover={false}`. Unfortunately [native popovers](https://caniuse.com/?search=popover) are still a little too new for...
The model data this library loads is the same as the C++ Tesseract, so this means that you can load files from https://github.com/tesseract-ocr/tessdata_best for your language. > How can we...