PaddleNLP [Tokenizer] Support reading Tiktoken tokenizer.model.

PR types

New features

PR changes

APIs

Description

Support reading Tiktoken tokenizer.model.
Split PretrainedTokenizerBase.from_pretrained into two separate methods: from_pretrained and _from_pretrained.
Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set use_fast=True )
Use LazyMapping to load keys and values when it is accessed.
Modify tests/transformers/test_modeling_common.py to support LlamaTokenizerFast

TOKENIZER_MAPPING_NAMES, MODEL_NAMES_MAPPING, CONFIG_MAPPING_NAMES should be reviewed carefully

Sep 28 '24 12:09 lvdongyi

Thanks for your contribution!

Sep 28 '24 12:09 paddle-bot[bot]

Codecov Report

Attention: Patch coverage is 66.66667% with 156 lines in your changes missing coverage. Please review.

Project coverage is 52.80%. Comparing base (78f911a) to head (5579695). Report is 246 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/auto/factory.py	46.06%	48 Missing :warning:
paddlenlp/utils/import_utils.py	53.52%	33 Missing :warning:
paddlenlp/transformers/auto/tokenizer.py	76.47%	24 Missing :warning:
paddlenlp/transformers/auto/configuration.py	72.30%	18 Missing :warning:
paddlenlp/transformers/convert_slow_tokenizer.py	76.11%	16 Missing :warning:
paddlenlp/transformers/llama/tokenizer.py	41.17%	10 Missing :warning:
paddlenlp/transformers/tokenizer_utils_base.py	84.84%	5 Missing :warning:
paddlenlp/transformers/configuration_utils.py	50.00%	2 Missing :warning:

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9215      +/-   ##
===========================================
- Coverage    53.19%   52.80%   -0.40%     
===========================================
  Files          673      673              
  Lines       108855   107657    -1198     
===========================================
- Hits         57909    56849    -1060     
+ Misses       50946    50808     -138

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Sep 28 '24 12:09 codecov[bot]

需要合入的话，可以 @ 我

Oct 23 '24 08:10 ZHUI

需要合入的话，可以 @ 我

目前不知道什么原因，PaddleNLP-CI会卡在running P0case 2/4: albert

Oct 23 '24 08:10 lvdongyi

好的，等一会儿 CI 吧，有个 Conflicting 可以处理一下

Oct 23 '24 09:10 ZHUI

好的，等一会儿 CI 吧，有个 Conflicting 可以处理一下

处理了

Oct 23 '24 09:10 lvdongyi