BertTokenizers
BertTokenizers copied to clipboard
Open source project for BERT Tokenizers in C#.
https://github.com/NMZivkovic/BertTokenizers/blob/150e40a178902bd258d4c9986dc1485c25c404b3/src/Helpers/VocabularyReader.cs#L18 https://huggingface.co/p208p2002/zh-wiki-punctuation-restore/blob/main/vocab.txt happens to contain `"\u2028"`
I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET 1. tokenizer.Encode should stop when sequenceLength is reached instead...
The following code: ```csharp var res = vocabulary.Tokenize("point™"); ``` never returns if `™` cannot be matched in the vocabulary. The issue was introduced in this commit: https://github.com/NMZivkovic/BertTokenizers/commit/0f29cefd5bcdc3dcfdea5b9d0133ccbc1d0d5023#diff-82215a359c504385d48356d59d6635f3b968278cca935c73977e16cea13f4174 Specifically this line:...
The package can already easily support additional .net versions. Please accept this commit, so we'll be able to reference the project as-is instead of creating a fork for this purpose....
See issue #22
Inside `BertUncasedCustomVocabulary.cs`, I would expect to see the declaration for a class extending `UncasedTokenizer`, however the class has the following signature `public class BertCasedCustomVocabulary : CasedTokenizer`. Something similar happens inside...
Fixes https://github.com/NMZivkovic/BertTokenizers/issues/24
This causes an infinite loop: ```cs var _tokenizer = new BertUncasedBaseTokenizer(); var sentence = "Linux\nline\nendings"; var tokens = _tokenizer.Tokenize(sentence); ``` The problem is that the `TokenizeSentence` method doesn't have '\n'...
The Bert*Tokenizer classes look in `"./Vocabularies"` which usually refers to the source folders, or at any rate where you have run it _from_, rather than where it is running. That...