Whitespace tokenizer for training BERT from scratch
Is there any example of using a whitespace tonkenizer (that splits text based only on whitespaces) for training BERT ?
See also https://github.com/huggingface/transformers/issues/3774
transformers.BertTokenizerFast(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', clean_text=True, tokenize_chinese_chars=True, strip_accents=True, wordpieces_prefix='##', **kwargs)
Maybe you can use this. By changing wordpieces_prefixto " "
Hi @aqibsaeed, please read my answer in another issue that should give you some direction on how to do this. Here is the answer in question: https://github.com/huggingface/tokenizers/issues/243#issuecomment-617860020
There is a WhitespaceSplit PreTokenizer that does split on whitespace only.
The default PreTokenizer we use in BertWordPieceTokenizer actually splits on whitespace, and also punctuation.
use WordLevel as Tokenizer.
see here- https://huggingface.co/docs/tokenizers/python/latest/components.html
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.