sapiens
sapiens copied to clipboard
chore(deps): bump tokenizers from 0.19.1 to 0.20.1
Bumps tokenizers from 0.19.1 to 0.20.1.
Release notes
Sourced from tokenizers's releases.
Release v0.20.1
What's Changed
The most awaited
offset
issue withLlama
is fixed 🥳
- Update README.md by
@ArthurZucker
in huggingface/tokenizers#1608- fix benchmark file link by
@152334H
in huggingface/tokenizers#1610- Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by
@dependabot
in huggingface/tokenizers#1626- [
ignore_merges
] Fix offsets by@ArthurZucker
in huggingface/tokenizers#1640- Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by
@dependabot
in huggingface/tokenizers#1629- Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by
@dependabot
in huggingface/tokenizers#1630- Bump send and express in /tokenizers/examples/unstable_wasm/www by
@dependabot
in huggingface/tokenizers#1631- Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by
@dependabot
in huggingface/tokenizers#1641- Fix documentation build by
@ArthurZucker
in huggingface/tokenizers#1642- style: simplify string formatting for readability by
@hamirmahal
in huggingface/tokenizers#1632New Contributors
@152334H
made their first contribution in huggingface/tokenizers#1610@hamirmahal
made their first contribution in huggingface/tokenizers#1632Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.0...v0.20.1
Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on
Llama3
running on a g6 instances on AWShttps://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py
:Python API
We shipped better deserialization errors in general, and support for
__str__
and__repr__
for all the object. This allows for a lot easier debugging see this:>>> from tokenizers import Tokenizer; >>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased"); >>> print(tokenizer) Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The
pre_tokenizer.Sequence
andnormalizer.Sequence
are also more accessible now:from tokenizers import normalizers norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()]) norm[0] norm[1].lowercase=False
... (truncated)
Commits
d98298a
0.20.1de305f2
update to ubuntu-22.041053470
use --interpreter ${{ matrix.interpreter || '3.7 3.8 3.9 3.10 3.11 3.12 pypy3...f7c33eb
add Cargoeca17be
v 0.20.1-rc1557fde7
style: simplify string formatting for readability (#1632)3d51a16
Fix documentation build (#1642)294ab86
Bump webpack in /tokenizers/examples/unstable_wasm/www (#1641)2b97a5e
Bump send and express in /tokenizers/examples/unstable_wasm/www (#1631)077678d
Bump serve-static and express in /tokenizers/examples/unstable_wasm/www (#1630)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
-
@dependabot rebase
will rebase this PR -
@dependabot recreate
will recreate this PR, overwriting any edits that have been made to it -
@dependabot merge
will merge this PR after your CI passes on it -
@dependabot squash and merge
will squash and merge this PR after your CI passes on it -
@dependabot cancel merge
will cancel a previously requested merge and block automerging -
@dependabot reopen
will reopen this PR if it is closed -
@dependabot close
will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually -
@dependabot show <dependency name> ignore conditions
will show all of the ignore conditions of the specified dependency -
@dependabot ignore this major version
will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this minor version
will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) -
@dependabot ignore this dependency
will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)