promptulate
promptulate copied to clipboard
:arrow_up: Bump tokenizers from 0.19.1 to 0.20.0
Bumps tokenizers from 0.19.1 to 0.20.0.
Release notes
Sourced from tokenizers's releases.
Release v0.20.0: faster encode, better python support
Release v0.20.0
This release is focused on performances and user experience.
Performances:
First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on
Llama3running on a g6 instances on AWShttps://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py:Python API
We shipped better deserialization errors in general, and support for
__str__and__repr__for all the object. This allows for a lot easier debugging see this:>>> from tokenizers import Tokenizer; >>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased"); >>> print(tokenizer) Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))>>> tokenizer Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The
pre_tokenizer.Sequenceandnormalizer.Sequenceare also more accessible now:from tokenizers import normalizers norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()]) norm[0] norm[1].lowercase=FalseWhat's Changed
- remove enforcement of non special when adding tokens by
@​ArthurZuckerin huggingface/tokenizers#1521- [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by
@​Narsilin huggingface/tokenizers#1513- Make
USED_PARALLELISMatomic by@​nathaniel-danielin huggingface/tokenizers#1532- Fixing for clippy 1.78 by
@​Narsilin huggingface/tokenizers#1548- feat(ci): add trufflehog secrets detection by
@​McPatatein huggingface/tokenizers#1551- Switch from
cached_downloadtohf_hub_downloadin tests by@​Wauplinin huggingface/tokenizers#1547- Fix "dictionnary" typo by
@​nprisbreyin huggingface/tokenizers#1511- make sure we don't warn on empty tokens by
@​ArthurZuckerin huggingface/tokenizers#1554- Enable
dropout = 0.0as an equivalent tononein BPE by@​mcognettain huggingface/tokenizers#1550- Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by
@​ArthurZuckerin huggingface/tokenizers#1569- Add bytelevel normalizer to fix decode when adding tokens to BPE by
@​ArthurZuckerin huggingface/tokenizers#1555- Fix clippy + feature test management. by
@​Narsilin huggingface/tokenizers#1580- Bump spm_precompiled to 0.1.3 by
@​MikeIvanichevin huggingface/tokenizers#1571- Add benchmark vs tiktoken by
@​Narsilin huggingface/tokenizers#1582- Fixing the benchmark. by
@​Narsilin huggingface/tokenizers#1583- Tiny improvement by
@​Narsilin huggingface/tokenizers#1585- Enable fancy regex by
@​Narsilin huggingface/tokenizers#1586- Fixing release CI strict (taken from safetensors). by
@​Narsilin huggingface/tokenizers#1593- Adding some serialization testing around the wrapper. by
@​Narsilin huggingface/tokenizers#1594
... (truncated)
Commits
a5adaacversion 0.20.0a8def07Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...fe50673Fix CIb253835push cargofc3bb76update dependenciesbfd9cdePerf improvement 16% by removing offsets. (#1587)bd27fa5add deserialize for pre tokenizers (#1603)56c9c70Tests + Deserialization improvement for normalizers. (#1604)49dafd7Fix strip python type (#1602)bded212SupportNoneto reset pre_tokenizers and normalizers, and index sequences (...- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot mergewill merge this PR after your CI passes on it@dependabot squash and mergewill squash and merge this PR after your CI passes on it@dependabot cancel mergewill cancel a previously requested merge and block automerging@dependabot reopenwill reopen this PR if it is closed@dependabot closewill close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
[!IMPORTANT]
Review skipped
Review was skipped due to path filters
Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lockYou can disable this status message by setting the
reviews.review_statustofalsein the CodeRabbit configuration file.
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
Tips
Chat
There are 3 ways to chat with CodeRabbit:
- Review comments: Directly reply to a review comment made by CodeRabbit. Example:
I pushed a fix in commit <commit_id>.Generate unit testing code for this file.Open a follow-up GitHub issue for this discussion.
- Files and specific lines of code (under the "Files changed" tab): Tag
@coderabbitaiin a new review comment at the desired location with your query. Examples:@coderabbitai generate unit testing code for this file.@coderabbitai modularize this function.
- PR comments: Tag
@coderabbitaiin a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:@coderabbitai generate interesting stats about this repository and render them as a table.@coderabbitai show all the console.log statements in this repository.@coderabbitai read src/utils.ts and generate unit testing code.@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.@coderabbitai help me debug CodeRabbit configuration file.
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.
CodeRabbit Commands (invoked as PR comments)
@coderabbitai pauseto pause the reviews on a PR.@coderabbitai resumeto resume the paused reviews.@coderabbitai reviewto trigger an incremental review. This is useful when automatic reviews are disabled for the repository.@coderabbitai full reviewto do a full review from scratch and review all the files again.@coderabbitai summaryto regenerate the summary of the PR.@coderabbitai resolveresolve all the CodeRabbit review comments.@coderabbitai configurationto show the current CodeRabbit configuration for the repository.@coderabbitai helpto get help.
Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
CodeRabbit Configuration File (.coderabbit.yaml)
- You can programmatically configure CodeRabbit by adding a
.coderabbit.yamlfile to the root of your repository. - Please see the configuration documentation for more information.
- If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation:
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
Documentation and Community
- Visit our Documentation for detailed information on how to use CodeRabbit.
- Join our Discord Community to get help, request features, and share feedback.
- Follow us on X/Twitter for updates and announcements.
Looks like tokenizers is up-to-date now, so this is no longer needed.