tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Whitespace tokenizer for training BERT from scratch

Open aqibsaeed opened this issue 5 years ago • 4 comments

Is there any example of using a whitespace tonkenizer (that splits text based only on whitespaces) for training BERT ?

aqibsaeed avatar Apr 13 '20 21:04 aqibsaeed

See also https://github.com/huggingface/transformers/issues/3774

julien-c avatar Apr 15 '20 12:04 julien-c

transformers.BertTokenizerFast(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', clean_text=True, tokenize_chinese_chars=True, strip_accents=True, wordpieces_prefix='##', **kwargs)

Maybe you can use this. By changing wordpieces_prefixto " "

parmarsuraj99 avatar Apr 21 '20 16:04 parmarsuraj99

Hi @aqibsaeed, please read my answer in another issue that should give you some direction on how to do this. Here is the answer in question: https://github.com/huggingface/tokenizers/issues/243#issuecomment-617860020

There is a WhitespaceSplit PreTokenizer that does split on whitespace only.

The default PreTokenizer we use in BertWordPieceTokenizer actually splits on whitespace, and also punctuation.

n1t0 avatar Apr 22 '20 16:04 n1t0

use WordLevel as Tokenizer. see here- https://huggingface.co/docs/tokenizers/python/latest/components.html

moranbel avatar Jun 23 '21 18:06 moranbel

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 28 '24 01:05 github-actions[bot]