[Tokenizer] Support reading Tiktoken tokenizer.model.
PR types
New features
PR changes
APIs
Description
-
Support reading Tiktoken tokenizer.model.
-
Split
PretrainedTokenizerBase.from_pretrainedinto two separate methods:from_pretrainedand_from_pretrained. -
Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set
use_fast=True) -
Use
LazyMappingto load keys and values when it is accessed. -
Modify
tests/transformers/test_modeling_common.pyto supportLlamaTokenizerFast
TOKENIZER_MAPPING_NAMES, MODEL_NAMES_MAPPING, CONFIG_MAPPING_NAMES should be reviewed carefully
Thanks for your contribution!
Codecov Report
Attention: Patch coverage is 66.66667% with 156 lines in your changes missing coverage. Please review.
Project coverage is 52.80%. Comparing base (
78f911a) to head (5579695). Report is 246 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #9215 +/- ##
===========================================
- Coverage 53.19% 52.80% -0.40%
===========================================
Files 673 673
Lines 108855 107657 -1198
===========================================
- Hits 57909 56849 -1060
+ Misses 50946 50808 -138
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
需要合入的话,可以 @ 我
需要合入的话,可以 @ 我
目前不知道什么原因,PaddleNLP-CI会卡在running P0case 2/4: albert
好的,等一会儿 CI 吧,有个 Conflicting 可以处理一下
好的,等一会儿 CI 吧,有个 Conflicting 可以处理一下
处理了