PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

[Tokenizer] Support reading Tiktoken tokenizer.model.

Open lvdongyi opened this issue 1 year ago • 2 comments

PR types

New features

PR changes

APIs

Description

  1. Support reading Tiktoken tokenizer.model.

  2. Split PretrainedTokenizerBase.from_pretrained into two separate methods: from_pretrained and _from_pretrained.

  3. Prefer not to use FastTokenizer even it is available. (When you want to load TokenizerFast through AutoTokenizer, you should explicitly set use_fast=True )

  4. Use LazyMapping to load keys and values when it is accessed.

  5. Modify tests/transformers/test_modeling_common.py to support LlamaTokenizerFast

TOKENIZER_MAPPING_NAMES, MODEL_NAMES_MAPPING, CONFIG_MAPPING_NAMES should be reviewed carefully

lvdongyi avatar Sep 28 '24 12:09 lvdongyi

Thanks for your contribution!

paddle-bot[bot] avatar Sep 28 '24 12:09 paddle-bot[bot]

Codecov Report

Attention: Patch coverage is 66.66667% with 156 lines in your changes missing coverage. Please review.

Project coverage is 52.80%. Comparing base (78f911a) to head (5579695). Report is 246 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/auto/factory.py 46.06% 48 Missing :warning:
paddlenlp/utils/import_utils.py 53.52% 33 Missing :warning:
paddlenlp/transformers/auto/tokenizer.py 76.47% 24 Missing :warning:
paddlenlp/transformers/auto/configuration.py 72.30% 18 Missing :warning:
paddlenlp/transformers/convert_slow_tokenizer.py 76.11% 16 Missing :warning:
paddlenlp/transformers/llama/tokenizer.py 41.17% 10 Missing :warning:
paddlenlp/transformers/tokenizer_utils_base.py 84.84% 5 Missing :warning:
paddlenlp/transformers/configuration_utils.py 50.00% 2 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9215      +/-   ##
===========================================
- Coverage    53.19%   52.80%   -0.40%     
===========================================
  Files          673      673              
  Lines       108855   107657    -1198     
===========================================
- Hits         57909    56849    -1060     
+ Misses       50946    50808     -138     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Sep 28 '24 12:09 codecov[bot]

需要合入的话,可以 @ 我

ZHUI avatar Oct 23 '24 08:10 ZHUI

需要合入的话,可以 @ 我

目前不知道什么原因,PaddleNLP-CI会卡在running P0case 2/4: albert

lvdongyi avatar Oct 23 '24 08:10 lvdongyi

好的,等一会儿 CI 吧,有个 Conflicting 可以处理一下

ZHUI avatar Oct 23 '24 09:10 ZHUI

好的,等一会儿 CI 吧,有个 Conflicting 可以处理一下

处理了

lvdongyi avatar Oct 23 '24 09:10 lvdongyi